Question on distributed crawl

danizen commented 7 years ago

I do not think my use-case will require a distributed crawl, but I still want to ask whether you have ever implemented one, and if not, whether you have given thought to how to implement it if this is ever needed.

Thanks.

BTW - I've started work on a kafka-committer. I'll probably have a test that runs with vagrant so that I know it truly works.

essiembre commented 7 years ago

Distributed crawling has not come up as a requirement for HTTP Collector yet. "Vertical" crawling has been the main focus so far. Setting up multiple instances, each crawling withing their own boundaries has been sufficient to address larger uses cases I faced. On an average machine, you should be able to crawl a few millions per instances without problems. There has not been serious demand for distribution other than here: #92.

If there is enough demand we might consider it. The thoughts so far have been to enable some component to optionally run into standalone services (crawling, crawl data storage, importing, committing, etc.), and integrate/distribute these components in the most efficient way into an existing distribution framework.

Let us know when you feel your Kafka Committer is stable enough and I will link to it from the Collectors site.

danizen commented 7 years ago

Good to see what someone else is looking for. My KafkaCommitter is ruly something I'm working after work, and a long way from being stable.

My main use case does involve some long-running enrichment tasks, and a KafkaCommitter makes sense as one of many ways to address that.

Maybe you can advice me on best practices with the actual problem ;) For each crawled site, I need to do some enrichment, by calling MTI's Web API BatchAccess described at https://ii.nlm.nih.gov/Batch/index.shtml. MTI is an NLM Specific Classifier/topic extractor that classifies text into NLM medical subject taxonomy, MeSH (Medical Subject Headings). It may take 3 seconds for an individual text, and 6 seconds for 50 such documents. So, accessing this as a "tagger" with one document at a time will not work :)

Another way would be to commit to ElasticSearch along with a constant status. Each enrichment task queries ElasticSearch for the oldest N documents with a particular status, and then re-index them after performing the enrichment. Merging takes awhile, but it could be the cost of doing it that way.

Your comments on #92 show the strength of Norconex. I like that Norconex is "crawl as code", but in comparison with tools such as Scrapy and Nutch, the configuration is limited to an XML file and properties. That's manageable.

essiembre commented 7 years ago

Interesting use case. My first instinct would be to extend the Elasticsearch Committer "commitBatch" method. In it, you merge the docs into a single call to your service, and then you parse the response and add the decoration to the list of ICommitOperation. Then you call super.

The committer already queues for you and allows you to control the batch size.

If you want to make it more generic, it could be a decorating class instead that takes another committer (so you are not bound to a specific one).

Could something like that work for you?

danizen commented 7 years ago

Could something like that work for you?

Maybe - Does the crawl continue in other threads while the committer is operating?

essiembre commented 7 years ago

Yes. You can synchronize the commitBatch method if you want otherwise.

danizen commented 7 years ago

So, then that could work.

What would happen if there were 100 documents queued, so a commit batch started, and while it was ongoing, 100 more documents arrived? Would that be a second thread, or would the queue size just get larger? If the latter, what would happen when the queue size exceeded the "optional maximum queueSize" that exists for both Solr and ES committer?

essiembre commented 7 years ago

When committing, only the X oldest entries in the queue are grabbed by the thread committing (up to the max configured). Other threads are free to keep queuing entries, or even send other batches.

The only part that is synchronized is the part where the X oldest entries for creating a batch to commit are identified and reserved to avoid two threads form identifying the same docs and commit them twice. So you should be safe. :-)

danizen commented 7 years ago

This will work for me, thanks.

Norconex / crawlers

Question on distributed crawl #345