Standardize how to specify materials properties for structures

The verbal discussion about this issue also included how to handle the same structure at multiple different levels of accuracy in the same database (as present in e.g. OQMD, matador). To provide a concrete example, if you wished to construct the Li-P binary phase diagram from our database, querying for Li-P phases would provide you with 10,000 geometry optimisations providing formation energies converged to 20-50 meV/atom, and 100 converged to 1 meV/atom. When querying across all databases, for the most common use case we would want to return our "best set" of calculations.

Do we want to specify a way for database providers to indicate that one set of calculations is "preferred"? Perhaps just as an optional field that by default is queried for True/null? This could also apply to e.g. "preferring" GW band gap calculations of a material over GGA gaps in the same database. This would require very careful wording in the spec to not be vague, but could be useful. I'm happy to open another issue if this is too off-topic.

One design coming up in my discussions about this is a separate properties endpoint. The recently introduced OPTiMaDe Relationships feature is used to tie structures, calculations, and properties together. If you want all properties pertaining to the structure with ID=42, you'd query the properties endpoint with "structures.id HAS 42". In the relationsships part of the JSON API response you can then, e.g., see in calculations that it was produced via calculations with IDs 64, 23, and 123.

About the point of @ml-evs: Before we settled on the present implementation of subdatabases one idea was to just have another query parameter database=<name>. What is being asked for here strikes me as something quite similar. Hence, could we have a standardized query parameter datasets=<comma separated list of dataset identifiers> where all request are restricted to the subset of entries indicated by those datasets? The API implementation is allowed to use anything it finds appropriate as default of datasets.

I would support the introduction of datasets parameter for the use-case of structures at multiple levels of accuracy and "properties" (as opposed to properties) that depend on multiple structures, e.g. formation energy, distance from convex hull. I assume that the query datasets=A,B would by default return A ∩ B (as opposed to HAS EXACTLY A, B or HAS ANY A, B)? As this is quite a simple feature to describe, I'm happy to make a draft PR (probably for v1.0) so that other's can have their say.

On the first point, I think a properties endpoint would be a sensible future addition to the spec, though standardization will be tricky when trying to compare apples to apples across databases. (I think this is the point that Abi (@tachyontraveler) was making in the discussion, so perhaps he could elaborate or correct me.) For example, what would be the mechanism for a user to query "give me all structures with a band gap between 1 and 3 eV" (and get sensible answers)? Is that something we even want to support?

I guess one option would be to explicitly keep properties unstandardized (beyond the generic entry fields) to make it clear that all values are database-specific and care must be taken when comparing them; while we can all agree on what a band gap is, agreeing on what a GGA band gap, hybrid band gap with 0.333% HF, LDA + scissor operator band gap, GW band gap, etc... in this case the property sub-type of band_gap could be standardized, and maybe even its units, but then all other keys must have the db-specific prefix, though this feels a little drastic to me...

I'm happy to make a draft PR (probably for v1.0) so that other's can have their say.

I view it as we are more or less in feature freeze for v1.0 now, so I suggest holding on until v0.10 is out, and we have possibly made the hypothetical conversion to RST. (But this shouldn't discourage anyone, I'm not opposed to v0.11 coming out rather soon after v0.10...)

I have been thinking a bit about the properties endpoint, . A proposal follows.

We may want to support two type of queries:

Queries from someone who knows and understands the models, who is able to be very precise in their queries. E.g., "I want all band gap calculations for Si made with DFT using the HSE functional with 0.2 - 0.5% HF exchange; and all GW band gaps that were made with self-consistent GW."
Queries from someone just asking for the "best value" of a true, physical, quantity: "What is the optical gap of Si?" (The point here is the two different mindsets. The queries could also be formulated as, e.g., 'Give me all materials that ...')

The proposal is to support queries of type 1 by careful standardization in the calculations endpoint. We standardize model-specific meta-data, e.g., calculation of type "DFT" -> using functional "HSE" -> using "25%" HF exchange, etc. We also have standardized fields for the output, e.g., ks_gap = 1.2 (eV). It is understood that not all settings of a calculation are encoded. For example, someone may set the ENAUG parameter in VASP to something quite unusual, and this isn't reflected at all in any of the standardized parameters. So, in this sense, the responsibility falls on the user not to grab this data and compare apples and oranges.

Queries of type 2, on the other hand, are served by the properties endpoint. All properties in this endpoint are to be physical quantities, i.e., things that can at least in principle be measured - so that there is one and only one correct answer and here the view is that any calculation is a 'model' for the correct value of this physical quantity. For example, we may carefully define quasiparticle_band_gap and optical_band_gap as two different things, but there certainly is no ks_band_gap.

The properties endpoint connects these standardized names to a value + an accuracy profile. The accuracy profile is some way to encode the expected accuracy of the value to the true physical quantity. The idea is that the answer to query 2 above is then the topmost result of, e.g., "structures.id = 44 AND name="optical_band_gap"&sort=accuracy'.

The tricky part here is, what is an accuracy profile for a property? It needs to encode both an error distribution (with as high a sophistication as we'd like to enable) AND it must encode the inaccuracy of that error distribution. The accuracy of the error distribution can go from "we believe the error is very small, but that is a pure guess" to "we have run millions of tests, so we are sure that the model used to calculate this value predicts the optical bandgap with a perfect normal distribution with this exact standard deviation". It may not be unusual to prefer a more well-tested, but less accurate, value over a less tested but believed to be more accurate result. Hence, I'd like a mandatory overall accuracy parameter to be a weighted together figure-of merit, so that, for the layman, sorting on accuracy gives the "best" value at the top.

So, what I'm imagining is a JSON response schema on the properties endpoint that looks something like this:

'data': [
{
  'id': 36,
  'type': 'property',
  'attributes': {
    'name': 'optical_band_gap',
    'value': 1.20736,
    'accuracy': 42.356,              // Mandatory
    'std_dev': 16.323,               // Mandatory
    'kurtosis': 2.3,                 // Optional
    'skewness': 1.34                 // Optional
    'distribution': {                // Optional
       'name':'weibull',
       'lambda': 1.23,
       'k': 5.43
    },
    'error_std_dev_std_dev': 3.233,              // Mandatory
    'error_std_dev_kurtosis': 2.3        // Optional
    'error_std_dev_skewness': 1.6        // Optional
    'error_std_dev_distribution': {              // Optional
       'name':'lognormal',
       'sigma': 0.26,
       'mu': 7.23
    },
    'error_kurtosis_std_dev': 3.233,              // Optional
   // ...
  },
  'relationships': {
    'calculations': [{'id':42, 'description':"produced_by"}],
    'structures': [{'id':44, 'description':'for_material'}],
  }
}
]

The precise details for how to specify these statistical quantities is of course up for discussion. But the idea here is to show the level of richness we may support in both specifying the estimate of the accuracy itself, and the estimated error of that accuracy.

I'm closing this via #376, but please feel free to re-open if there are aspects of this issue that need further refinement (e.g., explicit support for statistical quantities/multifidelity properties).

Materials-Consortia / OPTIMADE

Standardize how to specify materials properties for structures #74