Closed tobias-endres closed 3 years ago
The HTTP Collector focus is definitely more "vertical". There has been very little demand for horizontal scalability but that is starting to change I think. That being said, there are a few approaches that worked for different users.
Distributed crawl store
You can have your own implement of ICrawlDataStoreFactory
(or use existing, like MongoDB) that can be scaled horizontally. Then you should be able to have multiple crawlers sharing the same config. Maybe you will hit a few concurrency issues but hopefully, those will not be of concern.
Dynamic config generation. I have also seen other people creating their configuration dynamically, distributing the sites to crawl into multiple collectors/crawlers.
If you find other approaches, I invite you to share them.
The next major release should make it easier to perform horizontal scaling.
I had closed this issue by accident. Re-opening.
it's an interesting topic, so I'd like to participate :) regarding Distributed crawl store:
Then you should be able to have multiple crawlers sharing the same config.
if I'm not mistaken, a distributed store (like Mongo, NFS) would not make crawling efficient, if all the collectors share the same configs, I mean, the same sites will be potentially crawled as many times as many collector instances you have, e.g.
The next major release should make it easier to perform horizontal scaling.
could you please share some details on that? Thanks!
Hello @jetnet. Nothing is cast in stone for V3 yet and we welcome suggestions. One thing for sure, we would like to make it easier for developers to integrate it in other systems that can scale horizontally. "Vertical" crawling and simplicity will remain a key focus of version 3, but we are considering a few ideas. For example, having each important components (collector-importer-committer) runnable on their own as servers, to facilitate reuse/scalability.
The crawl store will be reworked as well, but in the end, it probably won't be used like you describe it. I mean, it can be distributed, but for a large crawl "session". Different crawler configurations/sessions should probably stick to their own crawl store "space" (distributed or not) for better results. Implementors can have different crawlers crawling some of the same content, but extracting/filtering things differently for whatever reason, so they should not overlap.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hello Norconex-Team,
I am wondering whether the http-collector can scale horizontally, meaning if it is possible to start multiple crawlers with the same configuration working on the same data? I am looking into running the collector on k8s and would love the possibility to dynamically scale the crawler-process on demand.
Best regards Tobias