Materials-Consortia / OPTIMADE

Specification of a common REST API for access to materials databases
https://optimade.org/specification
Creative Commons Attribution 4.0 International
77 stars 37 forks source link

Suggestion of an "/archives" endpoint #364

Open ml-evs opened 3 years ago

ml-evs commented 3 years ago

At the 2021 workshop, we discussed including an OPTIONAL archives entry type and corresponding endpoint in the specification. Below is an incomplete summary of the ideas that were discussed (please feel free to add/edit).

Other promoters: @sauliusg @jacksund

The idea is that this endpoint would serve static snapshots of an entire (as in, all endpoints) OPTIMADE implementation, potentially over subsets of the data (e.g., a particular set of materials).

This MUST be equivalent to what would be received by crawling an OPTIMADE API (in terms of format), and this could be represented as a hierarchical filesystem, e.g.

$ tree dump
dump
└── optimade.example.org
    └── v1
        ├── calculations.json
        ├── info
        │   ├── archives.json
        │   ├── calculations.json
        │   ├── links.json
        │   ├── references.json
        │   └── structures.json
        ├── links.json
        ├── references.json
        └── structures.json

3 directories, 9 files

Potential attributes:

Issues discussed

Enabling new use cases

Resources

jacksund commented 3 years ago

Thanks for the write-up @ml-evs.

#

For smaller databases, easier to archive and easier to deal with for the end user

We would still need to make converting someone's data to OPTIMADE format easier. Right now, it would require providers to read the OPTIMADE spec and convert from their current format for structures (poscar, cif, etc.). I think it's worth adding .to_optimade() methods for pymatgen Structure and/or ase Atoms classes. That way providers can automate conversion, regardless of initial structure format. These methods would also let the OPTIMADE absorb non-standardized datasets easily too.

When we go beyond just structures, this can be a lot of work though (e.g. even BandStructure classes would also need a to_optimade method)... This goes with your Implementation overhead bullet point.

#

Attribution & Licensing

This is probably the biggest roadblock to archives. Making it optional should make things a lot easier though. I'd anticipate the larger and more well-known a database is, the less they'll want to participate in this endpoint.

Also what if we add license to your list of attributes? So unique licensing would be attached to each individual archive dump.

Also the url attribute can also be (optionally) provider-controlled. So a cdn with authentication, a link to their own website, etc. This would leave download stats in their hands.

Another route is collecting usage statistics that can be sent back to providers (for them use in future grant proposals). Users would have to agree to such data collection if they want to download an archive. I'm personally against data collection, but it might be a necessary compromise for some providers to participate. This would have to be implemented in the OPTIMADE client package too.

Could remove some database load

One potential issue is that the OPTIMADE spec doesn't aim to be a condensed format. Instead it shoots for being robust/encompassing/flexible. So we could actually end up with dump files that are larger than the ones providers make themselves. For example, I was able to get all MP structures into a dump file below 100MB -- but I don't think I can get anywhere close to that value using the OPTIMADE spec and json format.