645 was a good idea in theory but needs fixing. The idea was to prevent pages from having their outlinks followed unless they had been flagged as being a sitemap (or not), basically, we have sitemaps, let's stick to what they contain.

For a given URL, this was done in the fetchers by setting isSitemap=false when sitemap files were found from the robots.txt, only if the robots was freshly fetched and not coming from the cache.

The outlinks are then filtered thanks to

{ "class": "com.digitalpebble.stormcrawler.filtering.metadata.MetadataFilter", "name": "MetadataFilter", "params": { "isSitemap": "false", "isFeed": "false" } }

isSitemap

set to true when finding the URL in the robots.txt or detecting them in the sitemap parser
set to false when finding the sitemap parser finds a terminal leaf i.e. definitely not a sub sitemap or by the fetcher, see logic above.

This works OK in most cases, nothing's affected if the MetadataFilter is not set apart from the following case: if a sitemap is redirected, the isSitemap key is lost - the idea being that we are not sure the target really is a sitemap and not a HTML page so it is not transferred automatically. When the redirected URLs is about to be fetched, and as we know there are sitemaps for that site, it gets isSitemap=false which prevents the detection of the sitemap to be applied. The sitemap parsing is skipped and the documents ends up being passed to the other parsers.

The detection is triggered by the config sitemap.sniffContent: true but is done anyway to fix incorrect mime types.

Here is a cleaner approach:

we can use a different key to mark that there are sitemaps for a site instead of relying on isSitemap=false, e.g. foundSitemap=true and let the SitemapParser flag the URL as isSitemap=false if it is really the case.
remove sitemap.sniffContent: true altogether there is no performance gain in not using it as we compute it anyway and it won't give false positives. Having it systematically also means that we can process redirected sitemaps successfully
add foundSitemap regardless of whether the robots came from the cache or not for more consistency

apache / incubator-stormcrawler

Fix the logic around sitemap = false #710

645 was a good idea in theory but needs fixing. The idea was to prevent pages from having their outlinks followed unless they had been flagged as being a sitemap (or not), basically, we have sitemaps, let's stick to what they contain.