commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC
Apache License 2.0
316 stars 34 forks source link

Check cross-submits for sitemaps #32

Open sebastian-nagel opened 5 years ago

sebastian-nagel commented 5 years ago

Sitemaps are automatically detected in the robots.txt but not checked for cross-submits. From time to time this leads to spam-like injections of URLs not matching the news genre. Recently, via one of their periodicals a publishing company "injects" their entire publishing program including landing pages for books and other media. This also happened for real estate ads before. Note that the sitemaps must follow the news sitemap format which is the barrier for most cross-submits but not always.

sebastian-nagel commented 5 years ago

Further scenario: a news site redirects on of their news articles to a page on another site as kind of an advertisement. We need to check the robots.txt of the target site, of course. But we should ignore the sitemap directives.