Closed ESapenaVentura closed 1 month ago
I have some concerns about this approach. Firstly, one purely about the implementation - defining three similarly-named properties doesn't create synonyms in JSON: these are three semantically distinct properties, and we don't really want to encourage the use of all three. It would be better to define a single property and add annotations to it's definition in the schema, either through additional keywords or even including it in the description etc. So, rather than loosen the schema to support three variants of collection date
and create ambiguity about which property should be used (or worse, having a single document that includes multiple different values for the "same" field), it would be better to only consider a single form to be valid... this then also makes it easier for tools to map and fix alternatives on the fly.
My second concern is about consistnecy with a) previous behaviour and b) the wider community. So: what is the current behvaiour in ENA? We need to be consistent with that so we can switchover, then we can start improving the behaviour in the new tools. IN terms of consistency with the wider community: I know that the Genomic Standards Consortium have siwtch to using Link-ML to define their standards, and there are tools that generate JSON Schema from LinkML. So, given that the GSC have a single source of truth for both the MIxS Water extension and the collection_date
term itself (https://genomicsstandardsconsortium.github.io/mixs/0000011/), if we are going to attempt to improve the way we represent these fields, we should be looking into solutions that leverage the GSC work directly rather than reimplement.
Overall, I'd argue for sidestepping this particular issue for now and ensuring our behaviour is consistent with the current implementation in ENA. We can then sensibly prioritise these types of issues and make some considered, well thought out improvements that help us take advantage of prior art, align with the wider community, and avoid a short term fix we might need to undo later
Untill we optimise schema further. A solution for this problem is popuating main properties
section with synonyms. i.e. Copying original attribute properties into synonym fields and expand the set of available attributes.
This has been done and read #77 for more info.
Description of the issue
Yesterday, while check the schema for the GCS MIxS Water (Dev schema), I noticed that, while the synonyms for the fields are recorded in the "oneOf" constraints, the synonyms themselves are not defined under the properties.
Example of this For the collection date, we define the following under
allOf
:I loaded the whole schema in python, and tried to print the properties. Only the property called
collection date
is defined.A practical example can be found below:
https://www.jsonschemavalidator.net/s/nQbWdab3 This is the schema, against a "valid" document. See the last property; it is not a valid value, but since
collection_date
is not fully defined, just required, it passesIf we changed the name of the property to
collection date
, it validates as expected: https://www.jsonschemavalidator.net/s/YmylrSROProposed solution
Rather than having the required properties being defined under properties and its requirement being defined under
allOf
, define both underallOf
. An example:Currently, if a property is required is defined like this:
I propose:
This is a bit more cumbersome, but defining all 3 properties under the "properties" field directly would lead to the possibility of including the 3 properties in a single document, rather than being restricted to pick 1.