commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC
Apache License 2.0
316 stars 34 forks source link

Consider archiving of news feeds and sitemaps #54

Open sebastian-nagel opened 1 year ago

sebastian-nagel commented 1 year ago

The news feeds and sitemaps can be useful by itself - the feeds more than the sitemaps because they include news titles and short snippets. It might make sense to put them also into the WARC files.

But first, it's important to understand what the storage foot print would be as feeds/sitemaps are refetched multiple times per day.