buren / wayback_archiver

Ruby gem to send URLs to Wayback Machine
https://rubygems.org/gems/wayback_archiver
MIT License
57 stars 11 forks source link

Road to v2 #68

Open buren opened 2 years ago

buren commented 2 years ago

Happy for any input!

buren commented 2 years ago

🔔 @danshearer, @bartman081523, @xplosionmind, @shoeper, @fgrehm, @jhcloos, @milliken, @jeanpauldejong

If any of you have any input or ideas I would love to her them! ⭐

danshearer commented 2 years ago

The good thing about retries is that they put off to some extent the need to think about tracking state.

The state issue is this: if there are 2000 URLs to submit, and wayback_archiver has submitted 300 of them before aborting with an exception, it seems a bit silly to restart at URL 1. A first implementation of state tracking might only cover sites that have a Sitemap, because then it is trivial.

danshearer commented 2 years ago

Here is a possible feature to consider: update-only pushing. This might be a better alternative to keeping state, even in the case of a Sitemap walk, as described below. At the moment wayback_archiver blindly pushes URLs.

I didn't find any reference to the Wayback Availability JSON API being ratelimited. Presumably it is at some level, but it would seem unlikely to be as strict as the submission API. That means it is a cheap operation to query if the URL we are about to push already exists, using Waybacks idea of "closest URL". But if, for example , there appears to be an identical URL with a snapshot time of 5 minutes ago then we might decide to skip it and move on to the next.

Using the Availability API means wayback_archiver can still be stateless and yet not keep repeating existing work. And for smaller-scale sites (say a few thousand URLs) we don't need any kind of sophisticated tree walk algorithm because the API is cheap.

danshearer commented 2 years ago

Easy features to implement would be --order-reverse and --order-random. This is like the very first of the "don't submit URL 1 again and again". It would start from the bottom of the Sitemap, or, do a random walk through the sitemap. Still doesn't keep any state, but it gives a modest improvement with almost no development effort.

bartman081523 commented 2 years ago

@buren Thank you for inviting me, I will happily and naively suggest:

  1. Loading and saving states in the crawling process; therefore a map would have to be created and saved first, with all the crawling targets, and the results from the crawling. This also makes tracking of errors and/or repeating crawl of failed or not-archived targets easier. (maybe gzip the state tracking sitemap in a temp dir)
  2. Maybe split functionalities between crawling and uploading, and only chain them in auto mode together. This also makes the process of archiving easier, when you can run crawling for many adresses with high threads and then just archive them one after another with low threads (as lately in archive.org) (just a naive suggestion)
  3. I think at least one archive service to add would give a better handling against errors or breaking changes on archive.org side. Maybe specify custom archive targets (--custom-target="https://archive.ph/submit/?&url=%%url%%") (not easy in the sense of results tracking, but I think most archive pages will forward to archive result) (just a naive suggestion)
  4. Specify filetypes to crawl (--filetype= all | txt | txt,pdf | all,-pdf | [etc])
  5. Json input and output, I recently read about how efficient in the sense of portability it is for programs, to be chained together with json input and output; might be not far off from csv output (just a naive suggestion)
  6. dont overcomplicate things in auto mode, leave easy-to-access auto mode, but it is most likely that you do :-D
MatzFan commented 2 years ago

Interesting project, was going to implement this myself until I found you. Some info which may or may not be helpful:

The (draft) Save Page Now 2 (SPN2) API docs are here. AFAICT this is the API the Wayback Machine uses for saving urls with an authenticated user. The spec uses cookie or API key authentication (I can't get the former to work). An authenticated page save results in a JSON response, so: curl web.archive.org/save -d "url=github.com" -H "Accept: application/json" -H "Authorization: LOW myaccesskey:mysecret" gives a 200 response and:

{"url":"github.com","job_id":"spn2-8674ce5a6bb3aa7e67c394bdc97a9fa1f6802f6b"}

You can then do a status update request on that job_id like this: curl web.archive.org/save/status/spn2-8674ce5a6bb3aa7e67c394bdc97a9fa1f6802f6b -H "Accept: application/json" -H "Authorization: LOW myaccesskey:mysecret" The JSON response includes lots of information, including a status key whose value may be "error", "pending" or "success". This could be used to retry failed jobs.

One other thing I've noticed: The job_id is simply "spn2-" followed by an sha1 hash of the url*.

*In the form http://\<url>/ So any of the following parameter data used in this example will yield the job_id above: github.com, http://github.com, https://github.com, github.com/ etc.. Proof:-

$ echo "http://github.com/"|tr -d "\n"|shasum
=> 8674ce5a6bb3aa7e67c394bdc97a9fa1f6802f6b  -

I've found at least one exception when, for example you save a page with a fragment (https://url/foo#bar) the sha1 hash is calculated on https://url/foo - i.e with https scheme retained and no trailing /

Also, beware frequently saved urls like example.com, as you'll just get the status of the most recent save by anyone.

The API includes rate limit parameters etc.

If you don't wish to include authenticated SPN2 REST API calls into your project I may create a gem just for that purpose, as I am considering building a server based archiving solution for long running jobs.

Other thoughts: Adding archive.today would be great, not sure they have an official API.

MatzFan commented 2 years ago

I may create a gem just for that purpose

So FWIW gem spn2 is now a thing. Bare bones but I'll add the rest of the SPN2 API functionality ASAP.