ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.31k stars 129 forks source link

Continuing or updating a grab #153

Closed nihelmasell closed 4 years ago

nihelmasell commented 5 years ago

Hi, is there a way to continue or update a warc capture? I've been trying your program, but with big sites, sometimes my connection is lost, so I have to start over and over again. (it would also be great to just update captures, as with big sites 50-100 gb subsequent captures take lots of space). Best regards

ivan commented 5 years ago

There's to way to continue or update, but for requests that fail to connect, the request will later be retried (note it can take a while because it goes to the end of the queue). The default is 3 tries but you can increase it. Also, if you notice your connection is down, you can pause all grab-site processes with killall -STOP grab-site and later killall -CONT grab-site to resume.

ivan commented 4 years ago

Leaving issue #58 for resuming a grab

baznikin commented 2 years ago

Hi, Ivan!

Great software, just spotted it yesterday. Sorry for "necro-issing', but as far as I notice wpull supports resuming internally - https://wpull.readthedocs.io/en/master/usage.html#stopping-resuming Any chances to implement it?

However, I am more interested in archive updating. Could you suggest how to do it?

TheTechRobo commented 2 years ago

58 for resuming

baznikin commented 2 years ago

58 for resuming

Thanks, just found it myself, wanted to remove my comment. Maybe it worth to put some info regarding resuming in FAQ.

But what about updating of previous archive, is it possible?