apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
887 stars 262 forks source link

Fix the logic around sitemap = false #710

Closed jnioche closed 5 years ago

jnioche commented 5 years ago

645 was a good idea in theory but needs fixing. The idea was to prevent pages from having their outlinks followed unless they had been flagged as being a sitemap (or not), basically, we have sitemaps, let's stick to what they contain.

For a given URL, this was done in the fetchers by setting isSitemap=false when sitemap files were found from the robots.txt, only if the robots was freshly fetched and not coming from the cache.

The outlinks are then filtered thanks to

{ "class": "com.digitalpebble.stormcrawler.filtering.metadata.MetadataFilter", "name": "MetadataFilter", "params": { "isSitemap": "false", "isFeed": "false" } }

isSitemap

This works OK in most cases, nothing's affected if the MetadataFilter is not set apart from the following case: if a sitemap is redirected, the isSitemap key is lost - the idea being that we are not sure the target really is a sitemap and not a HTML page so it is not transferred automatically. When the redirected URLs is about to be fetched, and as we know there are sitemaps for that site, it gets isSitemap=false which prevents the detection of the sitemap to be applied. The sitemap parsing is skipped and the documents ends up being passed to the other parsers.

The detection is triggered by the config sitemap.sniffContent: true but is done anyway to fix incorrect mime types.

Here is a cleaner approach:

jnioche commented 5 years ago

the sitemap parser will mark the docs as isSitemap=false but that won't be used to filter the outlinks. Instead we'll need a better version of the metadata filter which can combine metadata i.e. filter if a doc has both foundSitemap=true and isSitemap=false