commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC
Apache License 2.0
316 stars 34 forks source link

Ensure that seeds are refetched from time to time even if failed or redirected #14

Closed sebastian-nagel closed 7 years ago

sebastian-nagel commented 7 years ago

All manually collected seeds (feeds or future seed formats, e.g., news sitemaps) should be refetched from time to time (weekly or monthly) even if they are redirected or failed to fetch:

Ideally, the fetch schedule should be configurable for a combination of metadata and fetch status.

sebastian-nagel commented 7 years ago

Testing DigitalPebble/storm-crawler#386 in production, waiting for DigitalPebble/storm-crawler#420