Closed ghost closed 5 years ago
I think wpull supports resuming a crawl (with potentially long startup times), but grab-site never wrapped that functionality. I've sometimes poorly resumed a crawl by dumping the todo
URLs with gs-dump-urls
and doing a grab-site --1 -i LIST
, but otherwise I just start the crawls from scratch. I wish this worked better, sorry.
No worries. Thanks for the gs-dump-urls tip. I wasn't even aware of that command. I'll try that :)
( starting from scratch on a 12M+ lines list isn't really an option i'd choose if i can avoid it :) )
gs-dump-urls seems to have solved my problem in easy manner.
for future reference for anyone else coming across the same problem:
#example in my situation,where caused by computer crash/reboot
########################
#get any that links were being processed at time of crash
docker exec warcfactory gs-dump-urls data-anotids-yt_anot_urls_nodupcheck.txt-2018-12-02-a354fb31/wpull.db in_progress > links_to_continue.txt
#append any links that was still waiting to be processed, to the list
docker exec warcfactory gs-dump-urls data-anotids-yt_anot_urls_nodupcheck.txt-2018-12-02-a354fb31/wpull.db todo >> links_to_continue.txt
I had a lengthy grab going that unintentionally got cut off by either a crash or a reboot. I still have the data as it was at the time.
Is there any recommended way to resume a specific crawl, and not having to go trough all the urls already processed? (I used an input list of urls)
(e.g somehow adding already processed urls to a dupe- or ignore list into a new grab job. Or somehow subtracting them from the input list of a new grab job ?)