Implement CDX search based on newer `timemap` CDX API

Mr0grog commented 6 years ago

From a conversation on the Internet Archive’s Research Slack today:

kenji Igor http://spacex.com/robots.txt has Disallow: /includes/ and http://web.archive.org/cdx/search still honors robots.txt exclusion (because it’s served by older wayback machine), while playback ignores robots.txt (served by new wayback machine).

http://web.archive.org/web/timemap/cdx?url=www.spacex.com&matchType=domain&gzip=false&filter=statuscode:200&to=20041229235959 will give you more results, including those under /include/ path. /web/timemap/cdx is served by new wayback.

I’m sorry for the confusing, inconsistent results - we’re trying to migrate all services to new wayback

oh btw, a tip: to=2004 will be interpreted as 20041231235959 (if you’re not excluding day 30 and 31 on purpose :smile:) (edited)

Igor kenji Thank you!

mr0grog Oh, I did not know about /web/timemap/cdx as opposed to just /cdx/search/cdx. Should I be using the former instead of the latter?

kenji /web/timemap/cdx is better functionality-wise, but it’s slower than /cdx/search. So I’d suggest /cdx/search as long as it works ok for your purpose.

mr0grog ah, ok Will need to consider which is the right path. Is there anything that documents the functional differences? e.g. the robots.txt issue would be a hard one to discover

Do you have a rough sense of how much slower /web/timemap/cdx is?

kenji I don’t have good benchmark result (it’s nice to have), but I find /web/timemap/cdx 10-20% slower for matchType=exact query. matchType=domain can be much slower.

We need to look into whether we should switch to /web/timemap/cdx.

Mr0grog commented 6 years ago

Other notes I have discovered in edgi-govdata-archiving/web-monitoring-processing#174: this new API doesn’t support resumeKey; you have to use page and pageSize for iterating through results (which is not as straightforward as you might think).

Mr0grog commented 5 years ago

Update: since the above conversation happened, Wayback folks have started gently pushing us to more actively use the newer services, like timemap CDX and SPN2. So I think the answer to this issue is probably “yes we should” now.

Mr0grog commented 5 years ago

Since this is still beta-ish, we should probably implement this alongside the old /cdx/search API.

Mr0grog commented 5 years ago

I’ve been holding off on this since @danielballan is in the middle of splitting off this code into http://github.com/edgi-govdata-archiving/wayback. It should be done, but in that new repo whenever it’s ready.

danielballan commented 5 years ago

Note to selves: once this is closed, it might be kind to state in the release notes how to migrate wayback v0.1 code to whatever API we settle on for timemap, if doing so is not too much trouble.

Mr0grog commented 5 years ago

FWIW, I think the API (from a user of this package’s perspective) would be the same. The Timemap CDX API (which, to be clear, is not the timemap API, which is a whole other thing!):

Returns data in the same format as the CDX API, but has some extra fields on the end that aren’t generally useful unless you have access to internal archive.org services (supposedly these will be removed from the public API at some point).
Does paging differently, but we don’t expose access to the paging in our Python API anyway, so this should mostly be an implementation detail that is largely invisible to a user. (In the current CDX API, you can paginate via resumeKey or via actual page size & number, but the latter will not give you recent data. In the new Timemap CDX API, there is no resumeKey and you must use page size & number, but it should include up-to-date data.)

danielballan commented 5 years ago

Ah, I was conflating the Timemap CDX API with the timemap API. I have half-absorbed the fact that they are different things, but I got confused here. Which one did wayback v0.1 implement?

Mr0grog commented 5 years ago

Wayback v0.1 implemented the Timemap API (not Timemap CDX, which isn’t really it’s name, but it doesn’t have one, and ¯\_(ツ)_/¯).

If helpful (since Wayback APIs are a half-documented, scattered situation):

The CDX API, which lets you search through a CDX-based index (and returns a subset of fields from each matching CDX record), is at http://web.archive.org/cdx/search/cdx

The “Timemap CDX” API is the same thing, but uses different code and (I think?) a separate CDX index, is at http://web.archive.org/web/timemap/cdx

(I call it “Timemap CDX” because of the URL. I have also heard “new CDX,” “beta CDX,” “CDX v2,” etc.)

The Timemap API is part of the Memento protocol (guide, RFC, Wayback-specific “docs”) which is a semi-standard agreed to by lots of archives. It doesn’t allow searching (it just lists mementos for a given URL), and lists results in HTTP Link header format at http://web.archive.org/web/timemap/link/<url>, e.g. http://web.archive.org/web/timemap/link/https://www.epa.gov/

(There is supposed to be an official JSON format, but I don’t know how to get it from Wayback. http://web.archive.org/web/timemap/json/<url> returns timemap data in CDX-json format, which is 🤷‍♀)

Mr0grog commented 5 years ago

I kind of feel like Timemap may be redundant when you have CDX available (since you can always search CDX for an “exact” [really SURT, not exact] URL match). But it’s possible timemap may be more optimized.

Mr0grog commented 5 years ago

Also, best documentation link I know of is here: https://archive.readme.io/docs

It’s mostly links to other docs, but at least it gets most of all the APIs listed. (Not how much it’s kept up-to-date, though. 🙁)

Mr0grog commented 2 years ago

Some updates here from recent conversations:

The old CDX search (/cdx/search/cdx) has some real funky issues around limit and showResumeKey that were major drivers for this new CDX search (/web/timemap/cdx). (See #65)
The new search supports limit, but not showResumeKey, and doesn’t do weird stuff with limit.
The new search only paginates with page + pageSize (which are still about blocks; size is not referring to a number of results), and is reliable, and includes all the indexes (so it’s up-to-date).
BUT if you use a non-exact search (i.e. matchType=prefix|host|domain or you use an * in the URL), it does not include the index for recent SavePageNow captures. It takes roughly 3 days for things in that index to make it into other indexes that do support those queries. So there are still caveats here, but they are simpler to explain and are actually pretty predictable (the out-of-date issue is only a few days, not a few months).
archive.org is doing a slow transition to the new search, using it for some things under the hood to test it out.
Eventually (no concrete timeline yet) the old search will be replaced with the new one.
The new search includes extra fields (length, offset, WARC filename) that they expect to remove when replacing the old search, so we should not expect them to always be present.

So I think we probably need to ultimately have 3 methods for CDX search (these names are strawman proposals, they probably aren’t great):

search_v1() uses /cdx/search/cdx and paginates via showResumeKey (i.e. what is currently called search()).
search_v2() uses /web/timemap/cdx and paginates via page + pageSize (i.e. the new search).
search() just forwards to one of those implementations.

I’m also thinking we might want to rename search*() methods to listMementos() or listCaptures() or something, since the Internet Archive has an actual free text search of wayback now (e.g. https://web.archive.org/web/*/environment which is powered by https://web.archive.org/__wb/search/anchor?q=<text>, but also some endpoints at https://be-api.us.archive.org/ia-pub-fts-api, /services/search/v1/scrape, and /advancedsearch.php, all of which I don’t know enough about the differences or pros/cons for).

That renaming might be out of scope here, though.

Mr0grog commented 11 months ago

Circling back on the naming issue here, my current feeling is that the name should involve timemap rather than v2. The two have existed alongside each other for a long time now, and it’s no longer clear exactly what the migration or succession path is supposed to be (at one point I was told that the old CDX search at /cdx/search/cdx would call into the new implementation at /web/timemap/cdx under the hood, but trying the two confirms that they hit different backend servers and behave differently, and it’s been several years).

edgi-govdata-archiving / wayback

Implement CDX search based on newer `timemap` CDX API #8