Materials-Consortia / OPTIMADE

Specification of a common REST API for access to materials databases
https://optimade.org/specification
Creative Commons Attribution 4.0 International
83 stars 37 forks source link

Allow to specify `type` in `/info/<entry-endpoint>` #269

Closed CasperWA closed 1 year ago

CasperWA commented 4 years ago

To be able to dynamically and effectively handle (especially) provider-specific resource properties, e.g., _exmpl_some_energy, it would be useful to know the value type expected.

It would make most sense to add this under the entry-endpoint's specific informational endpoint, e.g., /info/structures for the /structures entry-endpoint. It could then be a key (type) alongside description, unit, and sortable. It should be OPTIONAL.

I don't think many would be against adding this, the only tricky thing I see, which could spark a bit of discussion, is the allowed/recognized values.

I would propose to use JSON-typed values, since our spec leans heavily on the JSON API v1.0 already. However, I wouldn't mind custom types either, as long as they have clear and unique links to major programming language types. In this, the JSON types may not even be suitable, and we should instead look towards OpenAPI's extended JSON types to better distinguish between an array, an array with unique entries, an ordered array, etc.

ml-evs commented 4 years ago

I really like this suggestion as it instantly makes clients more useful for exploring a particular implementation.

I would personally prefer to stick to the JSON API types, just so we don't mix standards in the spec. Are the OpenAPI types a superset of JSON types, or do they also change the meaning of some terms?

CasperWA commented 4 years ago

I would personally prefer to stick to the JSON API types, just so we don't mix standards in the spec. Are the OpenAPI types a superset of JSON types, or do they also change the meaning of some terms?

I think they are a superset, making use of not having single-word types, but defining them through a JSON object with the base type as the value for a type key, and then a "modifier" through other keys. This actually already exists in JSON for numbers.

In writing up the PR, I actually remembered that we have defined types in the spec, using list instead of array, and more. I will update the PR using these OPTIMADE-defined types.

rartino commented 4 years ago

I agree with the general idea.

I guess I am the one to disagree with JSON types. OPTIMADE has its own set of types for 'Entry properties' in section 2.1. I think those should be used here.

We essentially have type data defined for each entry type in section 7 under 'Type'. It is just a question how we "typograhpically" should represent things like 'list of strings' and 'list of list of floats or unknown values'. Should those just be 'list', or? It would be nice with a machine readable format that can validate the data that comes out.

Edit: I see now that @CasperWA added a comment just as I was writing this, essentially making the same observation about types.

CasperWA commented 4 years ago

We essentially have type data defined for each entry type in section 7 under 'Type'. It is just a question how we "typograhpically" should represent things like 'list of strings' and 'list of list of floats or unknown values'. Should those just be 'list', or? It would be nice with a machine readable format that can validate the data that comes out.

This is a great question I didn't really consider.

As I see it, there are two clear options:

However, there is most probably another way to do this, I just see those as the obvious ones.

rartino commented 4 years ago

Since you already were on the idea of reporting JSON types, I think I would be OK with a specification of type per output format. In that case the JSON output format could really use OpenAPI to specify a validateable JSON types.

But if so it would be quite nice if also the OPTIMADE types were specified somewhere, so we'd still have the original problem :)

CasperWA commented 4 years ago

Since you already were on the idea of reporting JSON types, I think I would be OK with a specification of type per output format. In that case the JSON output format could really use OpenAPI to specify a validateable JSON types.

So you would add OpenAPI lingo into this endpoint under a new top-level key type_by_format (or incorporate it into output_fields_by_format) or do you mean to add this into each property's type value, so one would specify the type according to the format?

But if so it would be quite nice if also the OPTIMADE types were specified somewhere, so we'd still have the original problem :)

Hehe, yeah - then one format would be optimade which is just jibberish, kind of.

rartino commented 4 years ago

So you would add OpenAPI lingo into this endpoint under a new top-level key type_by_format (or incorporate it into output_fields_by_format) or do you mean to add this into each property's type value, so one would specify the type according to the format?

Thinking this through, no option seem particularly elegant. It feels odd to separate out type information into a separate top key, if everything else is grouped together by property. To completely re-structure property info under these endpoints into an output_fields_by_format structure is also not a great option at this point. On the other hand, sticking what could be a massive data structure with one field per conceivable output format inside the property dictionaries seems a bulky option.

And all this for providing info that is anyway intended to be redundant with the OPTIMADE type information. Furthermore, isn't the OpenAPI type representation also already present in the JSON schema for the output?

Hence, after thinking more about this, I lean towards only giving OPTIMADE type info in a machine-readable format, and that we provide a library algorithm that maps that onto a OpenAPI schema fragment for the JSON output. Similar algorithms can then be created for other formats, e.g., to HDF5 types, etc.

CasperWA commented 4 years ago

Thinking this through, no option seem particularly elegant. It feels odd to separate out type information into a separate top key, if everything else is grouped together by property. To completely re-structure property info under these endpoints into an output_fields_by_format structure is also not a great option at this point. On the other hand, sticking what could be a massive data structure with one field per conceivable output format inside the property dictionaries seems a bulky option.

Yeah, I tend to agree.

And all this for providing info that is anyway intended to be redundant with the OPTIMADE type information. Furthermore, isn't the OpenAPI type representation also already present in the JSON schema for the output?

I don't think so. The OpenAPI type representation is (as I understand) an extension of the JSON type representation and only present in the openapi.json schemas. Nor are there any explicit type representation otherwise. However, there is of course the implicit representation (as when you would decode a JSON into Python type formats or similar).

Hence, after thinking more about this, I lean towards only giving OPTIMADE type info in a machine-readable format, and that we provide a library algorithm that maps that onto a OpenAPI schema fragment for the JSON output. Similar algorithms can then be created for other formats, e.g., to HDF5 types, etc.

I like this. We should then think about extending the spec with this info - or even better, have some proper documentation that fleshes out the spec more with examples and such. In that way we can maybe even thin out the actual spec, removing some examples and redundant text.

However, for me the question still stands:

CasperWA commented 4 years ago

On another note, the OPTIMADE data types does not take lists of unique elements into account (sets in Python - although sets in Python are different again, since they are unordered). This is also not a necessity to have (I think?), it is merely an observation. And I wonder if it's an oversight on our part originally?

rartino commented 4 years ago

@CasperWA Yes, there is no 'set' type in OPTIMADE, only lists. Any set can be represented as a list of the elements, so the need for this as a separate data type hasn't come up yet. IMO the distinction is more relevant in programming languages than for 'data transport' in API requests and responses. You are free to cast to/from sets as you wish in your implementation.

rartino commented 4 years ago

Furthermore, isn't the OpenAPI type representation also already present in the JSON schema for the output? I don't think so. The OpenAPI type representation is (as I understand) an extension of the JSON type representation and only present in the openapi.json schemas. Nor are there any explicit type representation otherwise.

Do we presently not do any form of validation that a JSON response containing property data confirms to an expected JSON data type for that property? We probably should do that.

However, for me the question still stands: Have only a single type value for each property, it being the "outermost" one, or Support "nested" property types à la my example above?

Personally I'm in favor of encoding the nested types if we can do it consistently. We already specify it in the specification, so it seems odd to throw away information.

CasperWA commented 4 years ago

Furthermore, isn't the OpenAPI type representation also already present in the JSON schema for the output? I don't think so. The OpenAPI type representation is (as I understand) an extension of the JSON type representation and only present in the openapi.json schemas. Nor are there any explicit type representation otherwise.

Do we presently not do any form of validation that a JSON response containing property data confirms to an expected JSON data type for that property? We probably should do that.

I don't think we do, no. Maybe we do do it in the OPTIMADE Validator, which is a part of the optimade-python-tools repository? But most likely it is implicit. What do you reckon @ml-evs?

Personally I'm in favor of encoding the nested types if we can do it consistently. We already specify it in the specification, so it seems odd to throw away information.

I agree, and will update #270 accordingly.

CasperWA commented 4 years ago

@CasperWA Yes, there is no 'set' type in OPTIMADE, only lists. Any set can be represented as a list of the elements, so the need for this as a separate data type hasn't come up yet. IMO the distinction is more relevant in programming languages than for 'data transport' in API requests and responses. You are free to cast to/from sets as you wish in your implementation.

I can agree with this sentiment.

CasperWA commented 4 years ago

All right. So as a first suggestion for nested types, I have this example for three custom properties:

{
  "data": {
    // ...
    "properties": {
      "_exmpl_entry_version": {
        // ...
        "type": "integer"
      },
      "_exmpl_list_prop": {
        // ...
        "type": {
          "type": "list",
          "content": {
            "type": "list",
            "content": "float"
          }
        }
      },
      "_exmpl_dict_prop": {
        // ...
        "type": {
          "type": "list",
          "content": {
            "type": "dictionary",
            "content": {
              "name": "string",
              "occupation": {
                "type": "list",
                "content": "float"
              }
            }
          }
        }
      }
      // ... <more properties>
    }
  }
  // ...
}

The only loss of information now is the possibility of a collection type containing more than a single type (so the information for _exmpl_list_prop about unknown values is lost). I guess this could be solved by letting content have a list value also, instead of either string or another dictionary? But I refrained from this at this point, since it would clutter it up too much, I think.

I don't know if this makes sense and is good enough, or we should instead try to appropriate the OpenAPI nomenclature/data structure for describing a field's type?

ml-evs commented 4 years ago

I don't think we do, no. Maybe we do do it in the OPTIMADE Validator, which is a part of the optimade-python-tools repository? But most likely it is implicit. What do you reckon @ml-evs?

It's implicit in the deserialization step to our pydantic models in the validator, though this may be more lenient than we want, as pydantic just uses standard Python type coercion under the hood. (e.g. bools, str and numerical types will all be essentially interchangeable, but structured types should be validated properly).

All right. So as a first suggestion for nested types, I have this example for three custom properties:

* `_exmpl_list_prop`: list of list of floats or unknown values

* `_exmpl_dict_prop`: list of dictionaries with keys:

  * `name`: string
  * `occupation`: list of floats

* `_exmpl_entry_version`: int

...

The only loss of information now is the possibility of a collection type containing more than a single type (so the information for _exmpl_list_prop about unknown values is lost). I guess this could be solved by letting content have a list value also, instead of either string or another dictionary? But I refrained from this at this point, since it would clutter it up too much, I think.

I don't know if this makes sense and is good enough, or we should instead try to appropriate the OpenAPI nomenclature/data structure for describing a field's type?

I like this approach. We don't have any mixed types in collections aside from unknown values, which are clearly described in the descriptions of fields that allow them.

At a later date, we could even add an additional flag to fields, something like allow_unknown to standardize this more rigorously across the spec?

CasperWA commented 4 years ago

I like this approach. We don't have any mixed types in collections aside from unknown values, which are clearly described in the descriptions of fields that allow them.

At a later date, we could even add an additional flag to fields, something like allow_unknown to standardize this more rigorously across the spec?

Cheers. And I like the idea of the allow_unknown flag. However, if we intend to allow collections with mixed types, it may be better to properly implement this from the start? I'm not completely sure. For me, mixed types in collections kind of defeats some of the automation a bit, but not completely of course, since everything is still ordered... But I would still prefer not to do this, and instead simply have the allow_unknown flag 👍

rartino commented 4 years ago

I have been arguing previously that we should commit to never mix types, apart from allowing nulls (and even this we should be somewhat careful with). We are trying to create a standard that spans backend and response format representations, and for some of those mixed formats may be tricky and/or very non-optimal to represent.

fekad commented 4 years ago

I'm really sorry but somehow I missed this issue before.

If I understand right there are four use cases when the machine-readable format of a property could be useful: 1) When you want to support autocompletion eg. in python notebooks 2) or when you want to read/load/adapt the result of a query itself 3) or when you want to validate the data before you put it into the database 4) or when you want to do a query using database-specific properties (like STARTS on a string and > on floats and HAS on arrays etc.) (which would be extremely useful in the future)

I think a solution which supports all of these use-cases would be great but:

1) In the first case, I think only the dict type provides you with useful information. The concrete and list types are not relevant here, but I think it is not a good idea to support nested properties (see below)

2) In this case, many formats (like hdf5) already stores the types alongside the data. In the case of JSON, the type of data can be inferred from the representation itself. Of course, in this case, it is possible to have weird cases (like "1" is a string, integer or a float question) but they usually cause only minor issues. If you really want to utilise all the type information you have to build your own JSON Decoder which can hit hard on your performance (even if you are only using custom validators in pydantic). I'm just saying you should trust in the data that you get by the query and ideally you should be able to use it as it is.

3) I think this case the use of the type information would definitely help to check the data before you push into the database but realistically there are too many options to do that. It is also quite tricky to implement a validator which gets their specification from the web through an API (which is not very realistic). Obviously, you can use the

4) I think types would be the most important in this case (and it will be), but now you can always try to do the query and allow to fail. Of course, it would be nice to have well-formatted error message but not convinced that would worth it.

In conclusion, I completely agree that the types are very important and in science a value doesn't mean anything without its context (definition, unit, type etc). I'm just saying that at the end this information is more useful for the human than for the machine :)

I would rather use a simple but still usefull representation like:

Notes:

The type would describe the type of the data and the shape its shape :) Obviously, an array is just a list of list of something but that definition doesn't give me any information about its shape. Of course, in this case, there is no way to describe nested data structures like a dictionary but I would rather allow having a nested property (like _exmpl.energy.total) rather than allowing to define nested properties types. In this case, every property in the dictionary can be described if it is needed otherwise it can be handled as any or dict

Small remarks:

I know a lot of decision has been already done for a reason (like not allowing the nested properties) so in that case please ignore those remarks without explaining it to me and just use those ones which are relevant for you.

ml-evs commented 4 years ago

@fekad I should probably point you towards @CasperWA's PR #270 which has some concrete suggestions, especially for dealing with nested properties (which are allowed, some OPTIMADE properties themselves are nested, like species). The discussion on that PR also been extended to providing descriptions for provider-specific fields.

fekad commented 4 years ago

@ml-evs Thanks, of course, I checked the PR before and I still have the same confusion namely:

Of course, that list of dictionary structure can be useful in many other cases I just can't see when would need to have the information about the types. (Querying is very difficult and as far as I know in other cases like autocompletion won't work either because of the need to use indices )

For example, the autocomplete won't work in the following cases:

CasperWA commented 4 years ago

For me, the original motivation and use case is your 4th stated use case above. Essentially, I have and am developing a client, which you can find here: https://dev-tools.materialscloud.org/optimadeclient/ In it, I would like, e.g., a filter tab to be called "Advanced" or "Provider-specific". Here, I would then add input filters depending on the provider-specific fields. Dynamically. There would be some behind the scenes logic to figure out if it's possible or not (if it's too nested or a certain combination of nestings), but most probably, it will just look for specific types that I have specific input handlers (widgets) for, e.g., str or int, and otherwise not include an allowable input for it.

You're comments above seem to go a bit above and beyond, I think, but touches on some crucial concepts. Since my use case was only for what I've outlined here, I didn't really consider this discussion in the context of other use cases. If we go forward with this implementation of a type key, we should make clear the use case, i.e., the intended use for it, and spell it out in the spec, perhaps? Personally, I don't think the goal should be to use this as response validation. I think we have better tools for that in the optimade-python-tools repository. Or at least that's the hope and intention. And for your second use case (handling the result of a query, i.e., a response), I think that's (to an extent) highly related to the handler language of choice. It's the issue of that particular taste and not a responsibility of OPTIMADE.

fekad commented 4 years ago

I also understand that nested properties are allowed to be used but as far as I can know there is no way to describe them in info/structures endpoint. I just don't understand why is it good to describe the first layers as properties and the rest as type definition?

The same applies to even simpler structures like: _odbx_dft_parameters In this case, you would have a single key element which contains a dictionary which contains a type definition which describes the nested structure...

I'm just saying I would use the nested structure from the beginning...

CasperWA commented 4 years ago

I also understand that nested properties are allowed to be used but as far as I can know there is no way to describe them in info/structures endpoint. I just don't understand why is it good to describe the first layers as properties and the rest as type definition?

In the PR there is a suggestion on how this can be mitigated and describe the nested properties as well.

fekad commented 4 years ago

I also understand that nested properties are allowed to be used but as far as I can know there is no way to describe them in info/structures endpoint. I just don't understand why is it good to describe the first layers as properties and the rest as type definition?

In the PR there is a suggestion on how this can be mitigated and describe the nested properties as well.

I was mainly looking 5.3.2 Entry Listing Info Endpoints and I missed the suggestions but I will continue there :)

CasperWA commented 4 years ago

Cheers @fekad. And I appreciate you continuing the discussion in this issue as well (instead of the PR). I'd like to see if we can keep the PR clear and only about suggested changes to the content (although I already kind of blew that one...). So that the issue becomes "design" and the PR becomes "implementation". Because otherwise I fear we'll lose complete track of the discussion details.

CasperWA commented 4 years ago

@fekad So with the example in the PR I finally understand your intended use of type in combination with shape. While it's a little tricky to wrap my head around when type describes the "first order" or "second order" type, I see the value in adding linked restrictions in the shape. But as I understand it, the addition of "links" in this way would be the main thing gained by using shape? Also, I don't think I see the major difference in using a list instead of a dictionary with the "name" as key? A property name has already been defined to be [a-z_][a-z_0-9]*, so it must be a string, so if JSON has issues with a key not being a string (I forget if it does), then that shouldn't be an issue :)

fekad commented 4 years ago

@fekad So with the example in the PR I finally understand your intended use of type in combination with shape. While it's a little tricky to wrap my head around when type describes the "first order" or "second order" type, I see the value in adding linked restrictions in the shape. But as I understand it, the addition of "links" in this way would be the main thing gained by using shape?

Mainly yes and it is also simpler to represent something likelist of list of list of floats structures in the case of a 3D array.

Also, I don't think I see the major difference in using a list instead of a dictionary with the "name" as key? A property name has already been defined to be [a-z_][a-z_0-9]*, so it must be a string, so if JSON has issues with a key not being a string (I forget if it does), then that shouldn't be an issue :)

I defined as a list of dictionaries mainly because of philosophical reason. I think the name, the definition theunits etc. simply belongs together ...

Are you saying that the following dictionary cannot be hosted by the OTPMADE API?

_odbx_dft_parameters.my_nice_dict: {
    "0": "H",
    "1": "H"
}

Of course in the python representation, the keys (0,1,...) would be integers and not strings I'm not saying that this example is useful at all but ... I also admit that in my proposed solution this would be only defined as generaldict

I'm definitely biased by the Nomad project where the same concept was used: https://www.nomad-coe.eu/index.php?page=archive-meta-info-guide

CasperWA commented 4 years ago

Are you saying that the following dictionary cannot be hosted by the OTPMADE API?

_odbx_dft_parameters.my_nice_dict: {
    "0": "H",
    "1": "H"
}

The key would not be allowed due to the . (period). But there are no restrictions on the value (except mixed values in "containers" are not allowed/discouraged).

fekad commented 4 years ago

Are you saying that the following dictionary cannot be hosted by the OTPMADE API?

_odbx_dft_parameters.my_nice_dict: {
    "0": "H",
    "1": "H"
}

The key would not be allowed due to the . (period). But there are no restrictions on the value (except mixed values in "containers" are not allowed/discouraged).

Sorry, example was not precise enough (it is not about the .period but rather about the restriction on the property name [a-z_][a-z_0-9]*):

"_odbx_dft_parameters":{
    "my_nice_dict": {
        "0": "H",
        "1": "H",
    }
}

But thanks for the reply anyway, this question does not belong to this issue anyway so might be better to discuss it somewhere else...

Back to the original concept, I don't think about the type and shape as "first order" or "second order" type definition. I would also say that an array is NOT a list of list of something rather an array with a specific shape and type. Usually, an array is stored contiguously rather than a linked list.

I know in the case of JSON an array is just a list of list of something, but you can find this type+shape concept in many programming languages:

IMHO, in the field of scientific computing, it also makes more sense...

CasperWA commented 4 years ago

Are you saying that the following dictionary cannot be hosted by the OTPMADE API?

_odbx_dft_parameters.my_nice_dict: {
    "0": "H",
    "1": "H"
}

The key would not be allowed due to the . (period). But there are no restrictions on the value (except mixed values in "containers" are not allowed/discouraged).

Sorry, example was not precise enough (it is not about the .period but rather about the restriction on the property name [a-z_][a-z_0-9]*):

"_odbx_dft_parameters":{
    "my_nice_dict": {
        "0": "H",
        "1": "H",
    }
}

I presume this restriction only exists for "top-level" properties, i.e., not for values with dicts. For these, the only limitations are what the response format allows. But I don't think this has ever been truly addressed.

But thanks for the reply anyway, this question does not belong to this issue anyway so might be better to discuss it somewhere else...

Yes, please.

Back to the original concept, I don't think about the type and shape as "first order" or "second order" type definition. I would also say that an array is NOT a list of list of something rather an array with a specific shape and type. Usually, an array is stored contiguously rather than a linked list.

I know in the case of JSON an array is just a list of list of something, but you can find this type+shape concept in many programming languages:

  • C++: int my_array [3][5];
  • numpy in python: my_array=np.zeros((2,3)); my_array.dtype, my_array.shape
  • juliad: my_array = zeros(2,3); typeof(my_array), size(my_array)
  • HDF5, etc.

IMHO, in the field of scientific computing, it also makes more sense...

Sure! I wasn't trying to talk against it, as much as trying to wrap my head around its limitations and connotations.

I think, my initial question about it was only what would be gained (or lost) by adding also shape. And in this, I think we've nailed the pros, in terms of the inter-property links/restraints, as well as the simpler way of describing heavily nested properties. What are the cons then? Expecting the providers to deliver too much information? That we're turning the /info/<entry-endpoint> into a schema validation endpoint? I don't know whether these are cons or how difficult it would be for everyone to provide this information. But what I want to get at is, how much can we demand to be added here before it becomes too much of a hassle for new providers, while still meeting the goals of the most common use cases?

ml-evs commented 1 year ago

I believe the combination of #376 and #282 allow this to be closed, after lots of great discussion!