commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC
Apache License 2.0
316 stars 34 forks source link

AdaptiveScheduler not applied to RSS/Atom feeds #19

Closed sebastian-nagel closed 6 years ago

sebastian-nagel commented 6 years ago

The fetchInterval in metadata is not properly updated for RSS/Atom feeds. News sitemaps do not seem to be affected... (seen with the recent version based on StormCrawler 1.8 / ElasticSearch 6.0)

sebastian-nagel commented 6 years ago

Seems to be caused by the signature not persisted. In this case (usually only the first fetch after injection), AdaptiveScheduler calls DefaultScheduler.

sebastian-nagel commented 6 years ago

Fixed, see storm-crawler#541 and verified that content signatures now are stored in ES.