fhamborg / news-please

news-please - an integrated web crawler and information extractor for news that just works
Apache License 2.0
1.99k stars 414 forks source link

Avoiding restart of commoncrawl scraping process #228

Closed joemkwon closed 2 years ago

joemkwon commented 2 years ago

Mandatory

Describe your question Trying to download ccnews articles that fall under a certain filtering requirement (added my own filters, that do stuff like process and predict likelihood of language being in English, etc.). However, because there are so many articles, it's unlikely for me to have my job complete before it's interrupted. When I start the process back up, I'm not sure whether the articles I had downloaded previously are being redownloaded, or it's starting back up where it left off before it was terminated. If it's the former, any workaround for making sure things don't get redownloaded every time the process starts up again?