Closed Tiptop4792 closed 2 weeks ago
hey :) thank you for your issue. the skipset does filter by url_archive
(the concatenated url for the snapshot)
one approach for you could be to remove the snapshots from the .cdx file. for this you have to keep that by --cdxbackup
or --auto
Awesome! Thanks!
Just to get this right:
I'd download the cdxbackup file, remove the snapshots I don't want and then reinsert the cdx file via --cdxinject <filepath>
. Right?
yes thats right. the cdx file contains the pure json response from the server and thus only the containing snapshots will be downloaded.
if you use the --auto
command, the downloader will handle the cdxbackup and its injection + skipping by its own. making it easier if you just want to make the process "failsafe" if any crash occurs.
in the long term maybe it would be an idea to add some kind of filter... is there a specific type or path you want to be removed?
I came by two occasions where this can be super useful:
This issue is marked as stale because there was no activity for 30 days.
This issue has been closed because there has been no activity for 14 days while it was marked as stale.
The --skip paramenter works great for interrupted downloads.
However, the othter day I wanted to download only specific files and exclude others. I couldn't figure out how to set up a csv file on my own.
Also, it didn't work when I tried to amand
waybackup_<sanitized_url>.csv
, created by the downloader. I tried to add the links I didn't want to download to the rowurl_origin
, but it didn't skip the links added.Any advice? Thanks!!