commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC
Apache License 2.0
316 stars 34 forks source link

Adaptive fetch schedule for feeds #15

Closed sebastian-nagel closed 7 years ago

sebastian-nagel commented 7 years ago

At present, the refetch schedule for seed feeds is globally 3 hours which is a compromise between

The schedule should adapt to the change frequency within a configurable min and max refetch interval (eg., 10 min. - 2 weeks). Detection of unchanged feeds should be independent of a last-modified time sent together with the server response.

sebastian-nagel commented 7 years ago

Adaptive scheduler in production since 3 weeks:

sebastian-nagel commented 7 years ago

Testing in production ongoing, waiting for DigitalPebble/storm-crawler#418