ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.31k stars 129 forks source link

Any way to resume a (input list) crawl? #142

Closed ghost closed 5 years ago

ghost commented 5 years ago

I had a lengthy grab going that unintentionally got cut off by either a crash or a reboot. I still have the data as it was at the time.

Is there any recommended way to resume a specific crawl, and not having to go trough all the urls already processed? (I used an input list of urls)

(e.g somehow adding already processed urls to a dupe- or ignore list into a new grab job. Or somehow subtracting them from the input list of a new grab job ?)

ivan commented 5 years ago

I think wpull supports resuming a crawl (with potentially long startup times), but grab-site never wrapped that functionality. I've sometimes poorly resumed a crawl by dumping the todo URLs with gs-dump-urls and doing a grab-site --1 -i LIST, but otherwise I just start the crawls from scratch. I wish this worked better, sorry.

ghost commented 5 years ago

No worries. Thanks for the gs-dump-urls tip. I wasn't even aware of that command. I'll try that :)

( starting from scratch on a 12M+ lines list isn't really an option i'd choose if i can avoid it :) )

ghost commented 5 years ago

gs-dump-urls seems to have solved my problem in easy manner.

for future reference for anyone else coming across the same problem:

#example in my situation,where caused by computer crash/reboot
########################
#get any that links were being processed at time of crash
docker exec warcfactory gs-dump-urls data-anotids-yt_anot_urls_nodupcheck.txt-2018-12-02-a354fb31/wpull.db in_progress > links_to_continue.txt

#append any links that was still waiting to be processed, to the list
docker exec warcfactory gs-dump-urls data-anotids-yt_anot_urls_nodupcheck.txt-2018-12-02-a354fb31/wpull.db todo >> links_to_continue.txt