[x] I searched other issues (including closed issues) and could not find any to be related. If you find related issues post them below or directly add your issue to the most related one.
[x] I confirm that this bug report does not report on a specific news site where news-please does not work. Please keep in mind that news-please is a generic crawler so it is expected that it will not work for all sites well or even at all.
Moreover, how much time should the process take to complete? I've been running and extractor on a 64-core machine for a week and it has not finished yet. Do you have some experience?
Versions (please complete the following information):
OS: [e.g. MacOS 10.15.7]
Python Version [e.g. 3.8.8]
news-please Version [e.g. 1.5.21]
Intent (optional; we'll use this info to prioritize upcoming tasks to work on)
Thanks for the report. I fixed the issue in the log statement :-) The process may actually take quite some time. If I remember correctly, it took as roughly a month to process all files
Mandatory
Describe the bug The total remaining time should be
remaining_warcs * h_per_warc
instead ofremaining_warcs / h_per_warc
here: https://github.com/fhamborg/news-please/blob/3c2562470601f060828e9d0ad05ebdbb5907641f/newsplease/crawler/commoncrawl_crawler.py#L243Moreover, how much time should the process take to complete? I've been running and extractor on a 64-core machine for a week and it has not finished yet. Do you have some experience?
Versions (please complete the following information):
Intent (optional; we'll use this info to prioritize upcoming tasks to work on)