ebi-ait / checklist

Template repository for checklists
Apache License 2.0
1 stars 0 forks source link

Schema generation does not account for synonyms #64

Closed ESapenaVentura closed 1 month ago

ESapenaVentura commented 2 months ago

Description of the issue

Yesterday, while check the schema for the GCS MIxS Water (Dev schema), I noticed that, while the synonyms for the fields are recorded in the "oneOf" constraints, the synonyms themselves are not defined under the properties.

Example of this For the collection date, we define the following under allOf:

[
{'required': ['collection_date']}, 
{'required': ['collection date']}, 
{'required': ['Event Date/Time']}
]

I loaded the whole schema in python, and tried to print the properties. Only the property called collection date is defined.

A practical example can be found below:

https://www.jsonschemavalidator.net/s/nQbWdab3 This is the schema, against a "valid" document. See the last property; it is not a valid value, but since collection_date is not fully defined, just required, it passes

If we changed the name of the property to collection date, it validates as expected: https://www.jsonschemavalidator.net/s/YmylrSRO

Proposed solution

Rather than having the required properties being defined under properties and its requirement being defined under allOf, define both under allOf. An example:

Currently, if a property is required is defined like this:

[{"required": ["collection_date"]}, {"required": ["collection date"]}, {"required": ["Event Date/Time"]}]

I propose:

[{'required': ['collection_date'], 'properties':  "collection_date": {
          "type": "array",
          "items": {
            "properties": {
              "text": {
                "type": "string",
                "pattern": "(^[12][0-9]{3}(-(0[1-9]|1[0-2])(-(0[1-9]|[12][0-9]|3[01])(T[0-9]{2}:[0-9]{2}(:[0-9]{2})?Z?([+-][0-9]{1,2})?)?)?)?(/[0-9]{4}(-[0-9]{2}(-[0-9]{2}(T[0-9]{2}:[0-9]{2}(:[0-9]{2})?Z?([+-][0-9]{1,2})?)?)?)?)?$)|(^not collected$)|(^not provided$)|(^restricted access$)|(^missing: control sample$)|(^missing: sample group$)|(^missing: synthetic construct$)|(^missing: lab stock$)|(^missing: third party data$)|(^missing: data agreement established pre-2023$)|(^missing: endangered species$)|(^missing: human-identifiable$)"
              }
            },
            "required": [
              "text"
            ]
          }
        }, 
{"required": ["collection date"], "properties": "collection date": {
          "type": "array",
          "items": {
            "properties": {
              "text": {
                "type": "string",
                "pattern": "(^[12][0-9]{3}(-(0[1-9]|1[0-2])(-(0[1-9]|[12][0-9]|3[01])(T[0-9]{2}:[0-9]{2}(:[0-9]{2})?Z?([+-][0-9]{1,2})?)?)?)?(/[0-9]{4}(-[0-9]{2}(-[0-9]{2}(T[0-9]{2}:[0-9]{2}(:[0-9]{2})?Z?([+-][0-9]{1,2})?)?)?)?)?$)|(^not collected$)|(^not provided$)|(^restricted access$)|(^missing: control sample$)|(^missing: sample group$)|(^missing: synthetic construct$)|(^missing: lab stock$)|(^missing: third party data$)|(^missing: data agreement established pre-2023$)|(^missing: endangered species$)|(^missing: human-identifiable$)"
              }
            },
            "required": [
              "text"
            ]
          }
        }, 
{'required': ['Event Date/Time'], 'properties': "Event Date/Time": {
          "type": "array",
          "items": {
            "properties": {
              "text": {
                "type": "string",
                "pattern": "(^[12][0-9]{3}(-(0[1-9]|1[0-2])(-(0[1-9]|[12][0-9]|3[01])(T[0-9]{2}:[0-9]{2}(:[0-9]{2})?Z?([+-][0-9]{1,2})?)?)?)?(/[0-9]{4}(-[0-9]{2}(-[0-9]{2}(T[0-9]{2}:[0-9]{2}(:[0-9]{2})?Z?([+-][0-9]{1,2})?)?)?)?)?$)|(^not collected$)|(^not provided$)|(^restricted access$)|(^missing: control sample$)|(^missing: sample group$)|(^missing: synthetic construct$)|(^missing: lab stock$)|(^missing: third party data$)|(^missing: data agreement established pre-2023$)|(^missing: endangered species$)|(^missing: human-identifiable$)"
              }
            },
            "required": [
              "text"
            ]
          }
        }
]

This is a bit more cumbersome, but defining all 3 properties under the "properties" field directly would lead to the possibility of including the 3 properties in a single document, rather than being restricted to pick 1.

tburdett commented 2 months ago

I have some concerns about this approach. Firstly, one purely about the implementation - defining three similarly-named properties doesn't create synonyms in JSON: these are three semantically distinct properties, and we don't really want to encourage the use of all three. It would be better to define a single property and add annotations to it's definition in the schema, either through additional keywords or even including it in the description etc. So, rather than loosen the schema to support three variants of collection date and create ambiguity about which property should be used (or worse, having a single document that includes multiple different values for the "same" field), it would be better to only consider a single form to be valid... this then also makes it easier for tools to map and fix alternatives on the fly.

My second concern is about consistnecy with a) previous behaviour and b) the wider community. So: what is the current behvaiour in ENA? We need to be consistent with that so we can switchover, then we can start improving the behaviour in the new tools. IN terms of consistency with the wider community: I know that the Genomic Standards Consortium have siwtch to using Link-ML to define their standards, and there are tools that generate JSON Schema from LinkML. So, given that the GSC have a single source of truth for both the MIxS Water extension and the collection_date term itself (https://genomicsstandardsconsortium.github.io/mixs/0000011/), if we are going to attempt to improve the way we represent these fields, we should be looking into solutions that leverage the GSC work directly rather than reimplement.

Overall, I'd argue for sidestepping this particular issue for now and ensuring our behaviour is consistent with the current implementation in ENA. We can then sensibly prioritise these types of issues and make some considered, well thought out improvements that help us take advantage of prior art, align with the wider community, and avoid a short term fix we might need to undo later

theisuru commented 1 month ago

Untill we optimise schema further. A solution for this problem is popuating main properties section with synonyms. i.e. Copying original attribute properties into synonym fields and expand the set of available attributes.

This has been done and read #77 for more info.