ArchiveTeam / urls-grab

Archiving URLs (outlinks) from a variety of sources.
The Unlicense
15 stars 5 forks source link

Deduplicate files over certain size with the Wayback Machine #5

Open Arkiver2 opened 3 years ago

Arkiver2 commented 3 years ago

Files over a certain size, and x number of bytes, should be looked up on IA, and if the URL already exists on IA (and is optionally from an Archive Team collection), it'll not be written to the WARC.

Arkiver2 commented 2 years ago

Functionality for this is implemented, just not used at the moment due to resource problems at IA. Minimum size to do this may need to be high.