Can the http-collector scale horizontally?

Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.

https://opensource.norconex.com/crawlers

Apache License 2.0

183 stars 68 forks source link

Can the http-collector scale horizontally? #631

Closed tobias-endres closed 3 years ago

tobias-endres commented 5 years ago

Hello Norconex-Team,

I am wondering whether the http-collector can scale horizontally, meaning if it is possible to start multiple crawlers with the same configuration working on the same data? I am looking into running the collector on k8s and would love the possibility to dynamically scale the crawler-process on demand.

Best regards Tobias

essiembre commented 5 years ago

The HTTP Collector focus is definitely more "vertical". There has been very little demand for horizontal scalability but that is starting to change I think. That being said, there are a few approaches that worked for different users.

Distributed crawl store You can have your own implement of ICrawlDataStoreFactory (or use existing, like MongoDB) that can be scaled horizontally. Then you should be able to have multiple crawlers sharing the same config. Maybe you will hit a few concurrency issues but hopefully, those will not be of concern.

Dynamic config generation. I have also seen other people creating their configuration dynamically, distributing the sites to crawl into multiple collectors/crawlers.

If you find other approaches, I invite you to share them.

The next major release should make it easier to perform horizontal scaling.

essiembre commented 5 years ago

I had closed this issue by accident. Re-opening.

jetnet commented 5 years ago

it's an interesting topic, so I'd like to participate :) regarding Distributed crawl store:

Then you should be able to have multiple crawlers sharing the same config.

if I'm not mistaken, a distributed store (like Mongo, NFS) would not make crawling efficient, if all the collectors share the same configs, I mean, the same sites will be potentially crawled as many times as many collector instances you have, e.g.

instance-1 starts and picks up the site-1 and runs
instance-2 starts, tries to pick up the site-1, fails (as it's locked already), picks up the site-2 and runs
assuming, the site-2 is smaller than the site-1
instance-2 finishes the site-2 and picks up the site-3
instance-1 finishes the site-1 and picks up the site-2, as it has no "idea", that the site-2 has been just processed.

The next major release should make it easier to perform horizontal scaling.

could you please share some details on that? Thanks!

essiembre commented 5 years ago

Hello @jetnet. Nothing is cast in stone for V3 yet and we welcome suggestions. One thing for sure, we would like to make it easier for developers to integrate it in other systems that can scale horizontally. "Vertical" crawling and simplicity will remain a key focus of version 3, but we are considering a few ideas. For example, having each important components (collector-importer-committer) runnable on their own as servers, to facilitate reuse/scalability.

The crawl store will be reworked as well, but in the end, it probably won't be used like you describe it. I mean, it can be distributed, but for a large crawl "session". Different crawler configurations/sessions should probably stick to their own crawl store "space" (distributed or not) for better results. Implementors can have different crawlers crawling some of the same content, but extracting/filtering things differently for whatever reason, so they should not overlap.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.