News archive is not available since 06.06.2021

commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC

Apache License 2.0

323 stars 35 forks source link

News archive is not available since 06.06.2021 #46

Closed zikolach closed 3 years ago

zikolach commented 3 years ago

There seems to be only one file available for 2021-06-06 and nothing since then. Are there any changes related to news dataset?

$ aws s3 ls --no-sign-request commoncrawl/crawl-data/CC-NEWS/2021/06/
2021-06-01 06:05:03 1072694208 CC-NEWS-20210601011537-00178.warc.gz
2021-06-01 08:05:03 1072700698 CC-NEWS-20210601032956-00179.warc.gz
...
2021-06-05 21:05:03 1072700332 CC-NEWS-20210605162324-00264.warc.gz
2021-06-05 22:05:03 1072724264 CC-NEWS-20210605180523-00265.warc.gz
2021-06-06 17:05:03 1072722205 CC-NEWS-20210605195038-00266.warc.gz

sebastian-nagel commented 3 years ago

Thanks, @zikolach! Issue confirmed, caused by a partial failure of the status index. The crawler is still running but practically without discovering new articles anymore. I'll hope to get it fixed in a few hours.

sebastian-nagel commented 3 years ago

The crawler is now back to normal and the first WARC file is uploaded. As expected, during the first two hours the crawler was mostly occupied fetching and parsing all the feeds and news sitemaps missed since Saturday 20:07 UTC when the status index failed. It's now running well and creating multiple WARC files per hour - to be uploaded soon. Thanks again, @zikolach!

zikolach commented 3 years ago

@sebastian-nagel thanks a lot for quick response and fixing!