fhamborg / news-please

news-please - an integrated web crawler and information extractor for news that just works
Apache License 2.0
1.99k stars 414 forks source link

Add an additional sitemap check in Sitemap crawlers #271

Closed Medno closed 3 days ago

Medno commented 1 week ago

Hey,

We've noticed that some robots.txt files don't reference any sitemaps, even though sitemaps exist on the website.

I propose an additional check during the support_site method in sitemap crawlers (SitemapCrawler and RecursiveSitemapCrawler) that will loop over a set of commonly used sitemaps to check for articles.

A new parameter is available sitemap_patterns in the configuration file that allows to ping a set of sitemaps in addition of robots.txt existence check

Changes

Medno commented 1 week ago

Hey @fhamborg , I look forward to your feedback. If you want to enable this logic with a configuration option let me know. Otherwise users may retrieve a lot of new articles

fhamborg commented 3 days ago

thanks! thats cool!

Medno commented 3 days ago

Thank you !