Need a function that downloads raw captured HTML from Internet Archive

danielballan commented 7 years ago

It should:

Check that URL is archived by the Internet Archive
Retrieve a list of the URIs and capture timestamps of all versions captured by the IA.
Formulate an 'Import' request for web-monitoring-db and POST it to the app.

It does not need to harvest any HTML from IA (as previously stated on an early version of this GH issue). We can just store the IA URI in our database; we don't need to maintain our own copy of it.

danielballan commented 7 years ago

There is an overview of WaybackMachine APIs with:

a limited JSON API that gets the URI and timestamp of the latest version of a URL: https://archive.org/wayback/available?url=nasa.gov
a link to a blog post about IA's Memento API which has a weird response format but lets us query for version URIs given a page URL. I implemented a Python API in #38.
a git repo with a README documenting the Wayback Machine CDX Server API which gives different information, like response status codes, but no version URI. Example: https://web.archive.org/cdx/search/cdx?url=nasa.gov&output=json&limit=-3

Other potentially useful links:

https://blog.archive.org/developers/
https://archive.readme.io/docs/memento
https://internetarchive.readthedocs.io/ (does not seem to address the Wayback Machine in particular)
a Python API to the limited JSON API (doesn't add much value to requests IMO)

One problematic wrinkle: IA inserts an HTML tool and special JS into the archived pages it serves. They attempt to delineate the inserted content with HTML comments -- search for wayback in the source of code this page, for example -- but it's not perfectly done.

klauer commented 7 years ago

Looks like it's pretty easy to get the raw pages from archive.org. (reference)

Compare the following (note the id_ at the end of the URL): With toolbar: https://web.archive.org/web/20060518204947/http://www.google.com/ Without toolbar: https://web.archive.org/web/20060518204947id_/http://www.google.com/

danielballan commented 7 years ago

Ha! Worlds collide. Thanks, @klauer, that's exactly what I was looking for.

titaniumbones commented 7 years ago

Tagging @bnewbold after a lovely conversation this morning at WADL. Nice to meet you Bryan! Looks like @klauer may have solved this issue after all, but just in case there are some remaining questions, I think this is the main place where we're talking about IA integration. Thank you and hope to see you around here!

danielballan commented 7 years ago

I would still love to get an overview of the API and a sense of whether any of them are "preferred" or particularly likely to be well-maintained in the future. @klauer solved my last question in this comment above but I'm still feeling disoriented by the many disparate APIs and scattered documentation.

danielballan commented 7 years ago

More useful links:

bnewbold commented 7 years ago

Hi @danielballan! Sorry for the slow reply. If you only need to work with IA, I would recommend the CDX API. Several internal and external systems depend on it, it's relatively optimized and cheap for us to serve, and is unlikely to go anywhere in coming years. If you want to be compatible with other web archives, building on the Memento API (which implements a public standard) might make more sense. Apologies that our documentation is all over the place... it reflects the age and evolution of our services and needs consolidation.

bnewbold commented 7 years ago

The CDX API docs at https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server are up to date. Note that you can request a JSON-formatted return value (if that saves you some parsing). The API details for handling very large response lists (hundreds of thousands of lines) are a little subtle, let me know if you need help taking advantage of that. It might help as context to understand that the CDX index is essentially an extremely large SURT-sorted text file: enumerating all crawls with a given prefix is very fast, but we have no secondary indexes of the other fields (eg, lookup by checksum or something like that). I strongly recommend using the checksum field to verify any HTML you do fetch, and storing that field in on your end for future verification. It's SHA-1 in base32 format (not the more popular hex encoding, but trivial to convert). From your earlier comment, i'm not sure what you meant by missing a "version URI"; do you mean a link to the content in wayback? This can safely be constructed using the timestamp and converting the SURT to URL, eg using a library like https://github.com/internetarchive/surt

edgi-govdata-archiving / web-monitoring-processing

Need a function that downloads raw captured HTML from Internet Archive #3