fhamborg / news-please

news-please - an integrated web crawler and information extractor for news that just works
Apache License 2.0
1.99k stars 414 forks source link

Required time by commoncrawl extractor and bug in logging #219

Closed lucadiliello closed 2 years ago

lucadiliello commented 2 years ago

Mandatory

Describe the bug The total remaining time should be remaining_warcs * h_per_warc instead of remaining_warcs / h_per_warc here: https://github.com/fhamborg/news-please/blob/3c2562470601f060828e9d0ad05ebdbb5907641f/newsplease/crawler/commoncrawl_crawler.py#L243

Moreover, how much time should the process take to complete? I've been running and extractor on a 64-core machine for a week and it has not finished yet. Do you have some experience?

Versions (please complete the following information):

Intent (optional; we'll use this info to prioritize upcoming tasks to work on)

fhamborg commented 2 years ago

Thanks for the report. I fixed the issue in the log statement :-) The process may actually take quite some time. If I remember correctly, it took as roughly a month to process all files