commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC
Apache License 2.0
316 stars 34 forks source link

Should follow subsitemaps in sitemap index #22

Closed sebastian-nagel closed 6 years ago

sebastian-nagel commented 6 years ago

The newscrawler uses only news sitemaps as "news feed" and ignores "ordinary" sitemaps not following the URLs listed there. However, the crawler should follow sitemaps listed in a sitemap index and check whether one of them is a news sitemap. E.g. while https://www.greenwichtime.com/sitemap_news.xml is not a news sitemap, it links to a bunch of news sitemaps:

<?xml version="1.0" encoding="UTF-8" ?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
    <loc>http://www.greenwichtime.com/sitemap/news/ap.xml</loc>
    <lastmod>2018-02-08T03:15:03Z</lastmod>
</sitemap>
...