bellingcat / auto-archiver

Automatically archive links to videos, images, and social media content from Google Sheets (and more).
https://pypi.org/project/auto-archiver/
MIT License
489 stars 53 forks source link

Improvement suggestions for WaybackArchiver #59

Closed vbanos closed 10 months ago

vbanos commented 1 year ago

Hi! I'm a member of Team Wayback at the Internet Archive. I have some improvement suggestions for https://github.com/bellingcat/auto-archiver/blob/0bdd06f6415e3ed4ec0582c991352b29d38cb891/archivers/wayback_archiver.py#L11

  1. You could use the Wayback Machine Availability API to easily get capture info about a captured URL https://archive.org/help/wayback_api.php. https://web.archive.org/web/<URL> is not recommended because its purpose is to playback the latest capture. You don't need to load the whole data of the latest capture of a URL, you just need to know if its available or not.
  2. Save Page Now API has a lot of useful options https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA/edit

if_not_archived_within=<timedelta> should be useful in your case.

Capture web page only if the latest existing capture at the Archive is older than the limit. Its format could be any datetime expression like “3d 5h 20m” or just a number of seconds, e.g. “120”. If there is a capture within the defined timedelta, SPN2 returns that as a recent capture. The default system is 30 min.

Cheers!

msramalho commented 1 year ago

Thanks for the tip @vbanos, this is a good improvement to implement.