Suggestion of an "/archives" endpoint

Materials-Consortia / OPTIMADE

Specification of a common REST API for access to materials databases

Creative Commons Attribution 4.0 International

83 stars 37 forks source link

At the 2021 workshop, we discussed including an OPTIONAL archives entry type and corresponding endpoint in the specification. Below is an incomplete summary of the ideas that were discussed (please feel free to add/edit).

Other promoters: @sauliusg @jacksund

The idea is that this endpoint would serve static snapshots of an entire (as in, all endpoints) OPTIMADE implementation, potentially over subsets of the data (e.g., a particular set of materials).

This MUST be equivalent to what would be received by crawling an OPTIMADE API (in terms of format), and this could be represented as a hierarchical filesystem, e.g.

$ tree dump
dump
└── optimade.example.org
    └── v1
        ├── calculations.json
        ├── info
        │   ├── archives.json
        │   ├── calculations.json
        │   ├── links.json
        │   ├── references.json
        │   └── structures.json
        ├── links.json
        ├── references.json
        └── structures.json

3 directories, 9 files

Potential attributes:

time_stamp/last_modified
checksum
description
version
size
compression_method
url

Issues discussed

Attribution: references endpoint is naturally included in the dump. Is this enough?
Licensing: do we need to provide a mechanism for licensing databases differently to filtered data? Do we need to worry about this more generally?
ACID: should it be an explicit requirement for serving archives?
Indexing: completely lost, alongside any context provided by additional endpoints. Maybe defines the natural dividing line between an archivable database vs not.
Implementation overhead: may require extra work to support, but for small databases it should be trivial. Should be no requirement on frequency of updates.

Enabling new use cases

For databases that already provide archives, in some format:
- Improved findability, plus standardization of OPTIMADE
- Could remove some database load, e.g. could even replace pagination of “empty” filtering
For smaller databases, easier to archive and easier to deal with for the end user
For non-existent databases, e.g. a dataset on figshare… if provided as an OPTIMADE archive then allows exploration with OPTIMADE clients and hybrid OPTIMADE local clients/servers
Archive-only databases, pointing to persistent long-term storage, indexed in the same way as the providers repository, e.g. GitHub repo that builds archives.optimade.org with defined prefixes.

Resources

Thanks for the write-up @ml-evs.

For smaller databases, easier to archive and easier to deal with for the end user

We would still need to make converting someone's data to OPTIMADE format easier. Right now, it would require providers to read the OPTIMADE spec and convert from their current format for structures (poscar, cif, etc.). I think it's worth adding .to_optimade() methods for pymatgen Structure and/or ase Atoms classes. That way providers can automate conversion, regardless of initial structure format. These methods would also let the OPTIMADE absorb non-standardized datasets easily too.

When we go beyond just structures, this can be a lot of work though (e.g. even BandStructure classes would also need a to_optimade method)... This goes with your Implementation overhead bullet point.

Attribution & Licensing

This is probably the biggest roadblock to archives. Making it optional should make things a lot easier though. I'd anticipate the larger and more well-known a database is, the less they'll want to participate in this endpoint.

Also what if we add license to your list of attributes? So unique licensing would be attached to each individual archive dump.

Also the url attribute can also be (optionally) provider-controlled. So a cdn with authentication, a link to their own website, etc. This would leave download stats in their hands.

Another route is collecting usage statistics that can be sent back to providers (for them use in future grant proposals). Users would have to agree to such data collection if they want to download an archive. I'm personally against data collection, but it might be a necessary compromise for some providers to participate. This would have to be implemented in the OPTIMADE client package too.

Could remove some database load

One potential issue is that the OPTIMADE spec doesn't aim to be a condensed format. Instead it shoots for being robust/encompassing/flexible. So we could actually end up with dump files that are larger than the ones providers make themselves. For example, I was able to get all MP structures into a dump file below 100MB -- but I don't think I can get anywhere close to that value using the OPTIMADE spec and json format.

Materials-Consortia / OPTIMADE