commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC
Apache License 2.0
316 stars 34 forks source link

Avoid following advertisements in news feeds and sitemaps #58

Open sebastian-nagel opened 9 months ago

sebastian-nagel commented 9 months ago

See also this discussion on Common Crawl's user group.

Some news sites sell slots in their news feeds and sitemaps and put advertisements there. The crawler follows these links the same way as it follows links to news articles. Because of a news sitemap auto-detection feature, thousands of "news" articles from the target site are then possibly crawled.

Potential ways to fight these ads: