Open ml-evs opened 3 years ago
Thanks for the write-up @ml-evs.
#
For smaller databases, easier to archive and easier to deal with for the end user
We would still need to make converting someone's data to OPTIMADE format easier. Right now, it would require providers to read the OPTIMADE spec and convert from their current format for structures (poscar, cif, etc.). I think it's worth adding .to_optimade()
methods for pymatgen Structure
and/or ase Atoms
classes. That way providers can automate conversion, regardless of initial structure format. These methods would also let the OPTIMADE absorb non-standardized datasets easily too.
When we go beyond just structures, this can be a lot of work though (e.g. even BandStructure
classes would also need a to_optimade
method)... This goes with your Implementation overhead
bullet point.
#
Attribution & Licensing
This is probably the biggest roadblock to archives. Making it optional should make things a lot easier though. I'd anticipate the larger and more well-known a database is, the less they'll want to participate in this endpoint.
Also what if we add license
to your list of attributes? So unique licensing would be attached to each individual archive dump.
Also the url
attribute can also be (optionally) provider-controlled. So a cdn with authentication, a link to their own website, etc. This would leave download stats in their hands.
Another route is collecting usage statistics that can be sent back to providers (for them use in future grant proposals). Users would have to agree to such data collection if they want to download an archive. I'm personally against data collection, but it might be a necessary compromise for some providers to participate. This would have to be implemented in the OPTIMADE client package too.
Could remove some database load
One potential issue is that the OPTIMADE spec doesn't aim to be a condensed format. Instead it shoots for being robust/encompassing/flexible. So we could actually end up with dump files that are larger than the ones providers make themselves. For example, I was able to get all MP structures into a dump file below 100MB -- but I don't think I can get anywhere close to that value using the OPTIMADE spec and json format.
At the 2021 workshop, we discussed including an OPTIONAL
archives
entry type and corresponding endpoint in the specification. Below is an incomplete summary of the ideas that were discussed (please feel free to add/edit).Other promoters: @sauliusg @jacksund
The idea is that this endpoint would serve static snapshots of an entire (as in, all endpoints) OPTIMADE implementation, potentially over subsets of the data (e.g., a particular set of materials).
This MUST be equivalent to what would be received by crawling an OPTIMADE API (in terms of format), and this could be represented as a hierarchical filesystem, e.g.
Potential attributes:
time_stamp
/last_modified
checksum
description
version
size
compression_method
url
Issues discussed
Enabling new use cases
Resources