Open Mr0grog opened 6 years ago
Other notes I have discovered in edgi-govdata-archiving/web-monitoring-processing#174: this new API doesn’t support resumeKey
; you have to use page
and pageSize
for iterating through results (which is not as straightforward as you might think).
Update: since the above conversation happened, Wayback folks have started gently pushing us to more actively use the newer services, like timemap CDX and SPN2. So I think the answer to this issue is probably “yes we should” now.
Since this is still beta-ish, we should probably implement this alongside the old /cdx/search
API.
I’ve been holding off on this since @danielballan is in the middle of splitting off this code into http://github.com/edgi-govdata-archiving/wayback. It should be done, but in that new repo whenever it’s ready.
Note to selves: once this is closed, it might be kind to state in the release notes how to migrate wayback v0.1 code to whatever API we settle on for timemap, if doing so is not too much trouble.
FWIW, I think the API (from a user of this package’s perspective) would be the same. The Timemap CDX API (which, to be clear, is not the timemap API, which is a whole other thing!):
Returns data in the same format as the CDX API, but has some extra fields on the end that aren’t generally useful unless you have access to internal archive.org services (supposedly these will be removed from the public API at some point).
Does paging differently, but we don’t expose access to the paging in our Python API anyway, so this should mostly be an implementation detail that is largely invisible to a user. (In the current CDX API, you can paginate via resumeKey
or via actual page size & number, but the latter will not give you recent data. In the new Timemap CDX API, there is no resumeKey
and you must use page size & number, but it should include up-to-date data.)
Ah, I was conflating the Timemap CDX API with the timemap API. I have half-absorbed the fact that they are different things, but I got confused here. Which one did wayback v0.1 implement?
Wayback v0.1 implemented the Timemap API (not Timemap CDX, which isn’t really it’s name, but it doesn’t have one, and ¯\_(ツ)_/¯).
If helpful (since Wayback APIs are a half-documented, scattered situation):
The CDX API, which lets you search through a CDX-based index (and returns a subset of fields from each matching CDX record), is at http://web.archive.org/cdx/search/cdx
The “Timemap CDX” API is the same thing, but uses different code and (I think?) a separate CDX index, is at http://web.archive.org/web/timemap/cdx
(I call it “Timemap CDX” because of the URL. I have also heard “new CDX,” “beta CDX,” “CDX v2,” etc.)
The Timemap API is part of the Memento protocol (guide, RFC, Wayback-specific “docs”) which is a semi-standard agreed to by lots of archives. It doesn’t allow searching (it just lists mementos for a given URL), and lists results in HTTP Link
header format at http://web.archive.org/web/timemap/link/<url>
, e.g. http://web.archive.org/web/timemap/link/https://www.epa.gov/
(There is supposed to be an official JSON format, but I don’t know how to get it from Wayback. http://web.archive.org/web/timemap/json/<url>
returns timemap data in CDX-json format, which is 🤷♀)
I kind of feel like Timemap may be redundant when you have CDX available (since you can always search CDX for an “exact” [really SURT, not exact] URL match). But it’s possible timemap may be more optimized.
Also, best documentation link I know of is here: https://archive.readme.io/docs
It’s mostly links to other docs, but at least it gets most of all the APIs listed. (Not how much it’s kept up-to-date, though. 🙁)
Some updates here from recent conversations:
/cdx/search/cdx
) has some real funky issues around limit
and showResumeKey
that were major drivers for this new CDX search (/web/timemap/cdx
). (See #65)limit
, but not showResumeKey
, and doesn’t do weird stuff with limit
.page
+ pageSize
(which are still about blocks; size
is not referring to a number of results), and is reliable, and includes all the indexes (so it’s up-to-date).matchType=prefix|host|domain
or you use an *
in the URL), it does not include the index for recent SavePageNow captures. It takes roughly 3 days for things in that index to make it into other indexes that do support those queries. So there are still caveats here, but they are simpler to explain and are actually pretty predictable (the out-of-date issue is only a few days, not a few months).So I think we probably need to ultimately have 3 methods for CDX search (these names are strawman proposals, they probably aren’t great):
search_v1()
uses /cdx/search/cdx
and paginates via showResumeKey
(i.e. what is currently called search()
).search_v2()
uses /web/timemap/cdx
and paginates via page
+ pageSize
(i.e. the new search).search()
just forwards to one of those implementations.I’m also thinking we might want to rename search*()
methods to listMementos()
or listCaptures()
or something, since the Internet Archive has an actual free text search of wayback now (e.g. https://web.archive.org/web/*/environment which is powered by https://web.archive.org/__wb/search/anchor?q=<text>
, but also some endpoints at https://be-api.us.archive.org/ia-pub-fts-api
, /services/search/v1/scrape
, and /advancedsearch.php
, all of which I don’t know enough about the differences or pros/cons for).
That renaming might be out of scope here, though.
Circling back on the naming issue here, my current feeling is that the name should involve timemap
rather than v2
. The two have existed alongside each other for a long time now, and it’s no longer clear exactly what the migration or succession path is supposed to be (at one point I was told that the old CDX search at /cdx/search/cdx
would call into the new implementation at /web/timemap/cdx
under the hood, but trying the two confirms that they hit different backend servers and behave differently, and it’s been several years).
From a conversation on the Internet Archive’s Research Slack today:
We need to look into whether we should switch to
/web/timemap/cdx
.