QualitativeDataRepository / archivr

Web Archiving Tool
Other
4 stars 2 forks source link

Add archive.is as third archiving option #35

Open adam3smith opened 3 years ago

mccallc commented 2 years ago

OK, I've come to the conclusion that implementing this source is not feasible. Is there something obvious I'm missing? Please let me know if there is.

There is a now-abandoned python implementation for submitting to archive.is (last updated 2020), but trying to use it now always generates a HTTP 429 error. I ran into the same problem trying to emulate the main form submission with rvest. If you try to browse to the site manually after that, you get hit with a CAPTCHA. I think they've walled the service off pretty well from basic scrapers.

The Memento robust links API discourages use for explicit archiving, and the tool they recommend for this purpose, archivenow's archive.is handler, implements submitting collections of URLs to archive.is by manually commandeering a running instance of Firefox (?!) through a library called selenium. The program itself isn't that complex, just such weird dependencies make it pretty hostile to implementation in R.