Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Dynamic committer routing based on language #299

Closed zdrd closed 8 years ago

zdrd commented 8 years ago

I'm trying to figure out a way to route a crawled document to a specific Elasticsearch index based on the language of the document. I am using the com.norconex.importer.handler.tagger.impl.LanguageTagger as a pre-parse tagger in my importer config, and I can store the language easily as a field in an elasticsearch index; But the structure I am trying to achieve is to have many Elasticsearch indices (index-en, index-fr, etc...), one index per language. So my question is: Is there any way I can dynamically route the document to it's corresponding Elasticsearch index based on its language (determined by Apache Tika, not any meta tag within the document) using the current version's configuration options?

essiembre commented 8 years ago

Right now there is no way to do "dynamic routing" like you suggest. I am flagging this as a feature request. In the meantime, you can write you own committer (or extend an existing one) that will perform that custom routing for you. Without writing code, the only way I can think of to achieve this through configuration alone would be to crawl your site once for every language you support, having a crawler defined for each, with their own committer. You use import filters to reject documents not in the proper language for each crawler. Not a pretty solution.

As a side-note, I would suggest you use the LanguageTagger as a post-parse tagger to make sure detection is performed on text only and not text + markup or binary files. You should get better results.

zdrd commented 8 years ago

Thanks for the reply, and the tip, I will do that.

essiembre commented 8 years ago

It just occurred to me there is a tricky situation with the approach you want to take you probably will have to account for: deleted documents.

When a URL gets re-crawled, and it no longer exists (e.g. 404), then its URL will be sent for deletion. At that point the content was not downloaded again so language detection is not possible. That means you will not have the language to tell you which index to use to delete the document.

The not-so-nice approach of having one crawler per language seem to be the only way to capture deletions properly.

I am un-marking this as a feature request unless we can thing of a reasonable alternative. Using a custom solution, you could probably just send the URL to be deleted to all indices you have, but I am not sure we want to make this a default practice.

Let me know what approach worked for you.

zdrd commented 8 years ago

As I started to think through how I would go about solving this with a custom committer, I also realized there would be several tricky edge-cases I would have to account for. I realized that having more indices is really no different (performance-wise) internally, than simply having more Elasticsearch shards in 1 single index; so I decided to just use a field to store the language instead, since the benefit of creating a custom committer wouldn't really be worth the effort.

If you did still want to add this as a feature, regarding the situation you describe of the collector not knowing which index the soon to be deleted document resides, as you mention, I also think this could only be solved by having the delete function search the entire Elasticsearch cluster (assuming the language-specific indices are all in the same cluster) rather than a specific index in that cluster, which is just as easy to do as an index-specific search (but with added latency of course). And this would really only be necessary when using the DELETE orphan strategy, right? Nonetheless, the feature is no longer required for what I am building, but I think it can certainly be accomplished in the event that you wanted to add it as a feature for others. Thanks for helping!

essiembre commented 8 years ago

I am glad you have found an approach that works for you. Having a solution where it searches for it first would prevent from making this a generic solution (available to other committers as well). Since you are OK for now, I will close this ticket and we will re-address it if this demand occurs more often. In the meantime, I added this idea to the Committer Core project internal TODO.txt file so it is not completely forgotten.

Thanks for your input!