We've noticed that some robots.txt files don't reference any sitemaps, even though sitemaps exist on the website.
I propose an additional check during the support_site method in sitemap crawlers (SitemapCrawler and RecursiveSitemapCrawler) that will loop over a set of commonly used sitemaps to check for articles.
A new parameter is available sitemap_patterns in the configuration file that allows to ping a set of sitemaps in addition of robots.txt existence check
Changes
Add a set of common sitemaps that will be checked during support_site step
Rename and update get_sitemap_url function to get_robots_response since the requesting may be duplicated over 2 functions.
It doesn't raise anymore but only returns a boolean
Hey @fhamborg , I look forward to your feedback. If you want to enable this logic with a configuration option let me know. Otherwise users may retrieve a lot of new articles
Hey,
We've noticed that some
robots.txt
files don't reference any sitemaps, even though sitemaps exist on the website.I propose an additional check during the
support_site
method in sitemap crawlers (SitemapCrawler and RecursiveSitemapCrawler) that will loop over a set of commonly used sitemaps to check for articles.A new parameter is available
sitemap_patterns
in the configuration file that allows to ping a set of sitemaps in addition ofrobots.txt
existence checkChanges
support_site
stepget_sitemap_url
function toget_robots_response
since the requesting may be duplicated over 2 functions.