commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC
Apache License 2.0
316 stars 34 forks source link

Automatic removal of ephemeral sitemaps #40

Open sebastian-nagel opened 4 years ago

sebastian-nagel commented 4 years ago

If a news site creates sitemaps with unique URLs on a daily base (or even in shorter intervals), over time this leads to too many sitemaps checked for updates, causing that news articles get stuck in queues jammed with sitemaps. The unique URLs pointing to sitemaps can stem from the robots.txt or a sitemap index. Typical URL/file patterns for ephemeral sitemaps are caused by including:

In the worst case, there can be 100k or even millions of sitemaps tracked for a domain, which requires to manually block or clean up the list of sitemaps, in order to be able to fetch news articles and follow the recent sitemaps.