datatogether / coverage

Project for visualizing the status of digital data archiving efforts across various data repositories
http://api.archivers.co/coverage
GNU Affero General Public License v3.0
2 stars 3 forks source link

Transform coverage repo into a backend service that feeds the archivers api #4

Closed b5 closed 7 years ago

b5 commented 7 years ago

While working on visualization, @blackglade has made the excellent point that coverage by separating urls into trees may not be enough to convey coverage of a topic-area. Issue is here: https://github.com/edgi-govdata-archiving/visualization-experiments/issues/1

I brought up primers (which are guides to topic areas built by volunteers), which seem to fit the Agency/Group model @blackglade is referencing. @titaniumbones recalled that archivers 2.0 supports a machine-readable form of primers, which we would need to have such a model. So it seems we need to play connect-the-dots between coverage calculation (this repo), and archivers 2.0 primers. The best place to do this seems like it would be via the archivers 2.0 api, where info about primers would be on hand.

The outcome would be a new set of api endpoints that look like /primers/{primer-id}/coverage, and that would hand back a JSON coverage tree very similar to the example one in this repo, except we would calculate one tree for each primer. We'd accomplish this using the same pattern matching techniques that sources use to traverse massive lists of urls looking for ones that apply.

Once that's in place the next step would be to add a content-only query param that would isolate content-urls away from html-urls, giving us a clear list of "files".

This repo would still perform it's job of listing & coordinating with outside services, but it would be exposed to the world via the API repo, which would have cleaner documentation on how to work with coverage.

I'm going to start experimenting with this approach now, as I don't want to leave @blackglade blocked for long. If we'd like to discuss specific API implementation issues, let's do that here, otherwise it's best to assume this api is in the works, and discuss how we're going to use it over on the main issue: https://github.com/edgi-govdata-archiving/visualization-experiments/issues/1