commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC
Apache License 2.0
321 stars 35 forks source link

Avoid following advertisements in news feeds and sitemaps #58

Open sebastian-nagel opened 11 months ago

sebastian-nagel commented 11 months ago

See also this discussion on Common Crawl's user group.

Some news sites sell slots in their news feeds and sitemaps and put advertisements there. The crawler follows these links the same way as it follows links to news articles. Because of a news sitemap auto-detection feature, thousands of "news" articles from the target site are then possibly crawled.

Potential ways to fight these ads: