[FEATURE] How do I set up a skip file for excluding specific file types?

bitdruid / python-wayback-machine-downloader

Query and download archive.org as simple as possible.

MIT License

34 stars 2 forks source link

[FEATURE] How do I set up a skip file for excluding specific file types? #23

Closed Tiptop4792 closed 1 month ago

Tiptop4792 commented 3 months ago

The --skip paramenter works great for interrupted downloads.

However, the othter day I wanted to download only specific files and exclude others. I couldn't figure out how to set up a csv file on my own.

Also, it didn't work when I tried to amand waybackup_<sanitized_url>.csv, created by the downloader. I tried to add the links I didn't want to download to the row url_origin, but it didn't skip the links added.

Any advice? Thanks!!

bitdruid commented 3 months ago

hey :) thank you for your issue. the skipset does filter by url_archive (the concatenated url for the snapshot)

one approach for you could be to remove the snapshots from the .cdx file. for this you have to keep that by --cdxbackup or --auto

Tiptop4792 commented 3 months ago

Awesome! Thanks!

Just to get this right:

I'd download the cdxbackup file, remove the snapshots I don't want and then reinsert the cdx file via --cdxinject <filepath>. Right?

bitdruid commented 3 months ago

yes thats right. the cdx file contains the pure json response from the server and thus only the containing snapshots will be downloaded.

if you use the --auto command, the downloader will handle the cdxbackup and its injection + skipping by its own. making it easier if you just want to make the process "failsafe" if any crash occurs.

in the long term maybe it would be an idea to add some kind of filter... is there a specific type or path you want to be removed?

Tiptop4792 commented 3 months ago

I came by two occasions where this can be super useful:

The other day I had an issue with file names that where too long, I wanted to remove those files from the csv manually, but didn't manage to reinject them (didn't know about -cdxinject before). But maybe --auto would do the better job in that particular situation?
I do a lot of bulk downloading and then searching for stuff. - So, I'm basically interested in text files, html. Don't need pictures, videos, etc. Removing those would speed up downloading and would take pressure off the Archive.

github-actions[bot] commented 2 months ago

This issue is marked as stale because there was no activity for 30 days.

github-actions[bot] commented 1 month ago

This issue has been closed because there has been no activity for 14 days while it was marked as stale.