Some news sites sell slots in their news feeds and sitemaps and put advertisements there. The crawler follows these links the same way as it follows links to news articles. Because of a news sitemap auto-detection feature, thousands of "news" articles
from the target site are then possibly crawled.
Potential ways to fight these ads:
block following cross-site links, ie. implement a cross submission validation, see #32
disable sitemap autodetect (of course, this may cause that sitemap seeds are lost if the URL changes)
See also this discussion on Common Crawl's user group.
Some news sites sell slots in their news feeds and sitemaps and put advertisements there. The crawler follows these links the same way as it follows links to news articles. Because of a news sitemap auto-detection feature, thousands of "news" articles from the target site are then possibly crawled.
Potential ways to fight these ads: