commoncrawl / cc-index-server

Common Crawl Index Server
http://index.commoncrawl.org/
65 stars 18 forks source link

Retrieve link for latest available index #2

Closed RootLUG closed 7 years ago

RootLUG commented 7 years ago

Feature request as per discussion at https://groups.google.com/forum/#!topic/common-crawl/PZ10uQCRk-E to add an API call/endpoint that is machine friendly for retrieving the latest available CC index.

RootLUG commented 7 years ago

From technical perspective the issue of pagination while index is switched might be solved either by client sending together with the page request also an index name (returned first by pagination API request) in case it's done by some URL aliasing/redirect (more complicated, might break something?) or that might not be problem if there is an endpoint that provides list of indexes with time stamps from which the client select the latest one and send there API requests as usual. (should be the easiest way without breaking any compatibility)

sebastian-nagel commented 7 years ago

After a look into pywb: there is already a handler which lists all available indexes. It's now enabled (http://test-index.commoncrawl.org/collinfo.json). The indexes are sorted from new to old: pick the first one to get the latest crawl/index.