Closed danielballan closed 7 years ago
There is an overview of WaybackMachine APIs with:
https://archive.org/wayback/available?url=nasa.gov
https://web.archive.org/cdx/search/cdx?url=nasa.gov&output=json&limit=-3
Other potentially useful links:
requests
IMO)One problematic wrinkle: IA inserts an HTML tool and special JS into the archived pages it serves. They attempt to delineate the inserted content with HTML comments -- search for wayback
in the source of code this page, for example -- but it's not perfectly done.
Looks like it's pretty easy to get the raw pages from archive.org. (reference)
Compare the following (note the id_
at the end of the URL):
With toolbar: https://web.archive.org/web/20060518204947/http://www.google.com/
Without toolbar: https://web.archive.org/web/20060518204947id_/http://www.google.com/
Ha! Worlds collide. Thanks, @klauer, that's exactly what I was looking for.
Tagging @bnewbold after a lovely conversation this morning at WADL. Nice to meet you Bryan! Looks like @klauer may have solved this issue after all, but just in case there are some remaining questions, I think this is the main place where we're talking about IA integration. Thank you and hope to see you around here!
I would still love to get an overview of the API and a sense of whether any of them are "preferred" or particularly likely to be well-maintained in the future. @klauer solved my last question in this comment above but I'm still feeling disoriented by the many disparate APIs and scattered documentation.
Hi @danielballan! Sorry for the slow reply. If you only need to work with IA, I would recommend the CDX API. Several internal and external systems depend on it, it's relatively optimized and cheap for us to serve, and is unlikely to go anywhere in coming years. If you want to be compatible with other web archives, building on the Memento API (which implements a public standard) might make more sense. Apologies that our documentation is all over the place... it reflects the age and evolution of our services and needs consolidation.
The CDX API docs at https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server are up to date. Note that you can request a JSON-formatted return value (if that saves you some parsing). The API details for handling very large response lists (hundreds of thousands of lines) are a little subtle, let me know if you need help taking advantage of that. It might help as context to understand that the CDX index is essentially an extremely large SURT-sorted text file: enumerating all crawls with a given prefix is very fast, but we have no secondary indexes of the other fields (eg, lookup by checksum or something like that). I strongly recommend using the checksum field to verify any HTML you do fetch, and storing that field in on your end for future verification. It's SHA-1 in base32 format (not the more popular hex encoding, but trivial to convert). From your earlier comment, i'm not sure what you meant by missing a "version URI"; do you mean a link to the content in wayback? This can safely be constructed using the timestamp and converting the SURT to URL, eg using a library like https://github.com/internetarchive/surt
It should:
It does not need to harvest any HTML from IA (as previously stated on an early version of this GH issue). We can just store the IA URI in our database; we don't need to maintain our own copy of it.