Open tcezard opened 1 year ago
Conclusion from today's discussion:
At this point, the schema is probably the only thing.
In the future, if we specify some ways to do some of the optional things, like:
compare
endpoint allow you to POST 2 remote sequence collections? Or does it require 1 to be a digest?The service-info
endpoint could specify that these are implemented. But only after the specification defines how these would be implemented if they are.
I think this is actually the same as issue #3
Or, maybe put another way: what is the relationship between /service-info
and the /metadata
endpoints that we discussed earlier?
Sequence collection's service info needs to comply with https://github.com/ga4gh-discovery/ga4gh-service-info/blob/master/service-info.yaml and therefore needs some GA4GH content as well as some content specific to sequence collection. The GA4GH content looks like:
{
"id": "uk.ebi.eva.eva-seqcol",
"name": "Implementation of the GA4GH Sequence collection retrieval API",
"type": {
"group": "org.ga4gh",
"artifact": "seqcol",
"version": "0.0.0"
},
"organization": {
"name": "European Variation Archive",
"url": "https://www.ebi.ac.uk/eva"
},
"contactUrl": "mailto:eva-helpdesk@ebi.ac.uk",
"documentationUrl": "https://www.ebi.ac.uk/eva/....",
"updatedAt": "date",
"environment": "dev",
"version": "tagged version"
}
If threre is nothing else but the schema to specify, the sequence collection specific content could look like:
"seqcol": {
"description": "A collection of sequences, representing biological sequences including nucleotide or amino acid sequences. For example, a sequence collection may represent the set of chromosomes in a reference genome, the set of transcripts in a transcriptome, a collection of reference microbial sequences of a specific gene, or any other set of sequences.",
"type": "object",
"properties":{
"lengths":{
...
}
But we might want to allow for other properties to define implementation specific feature in the future
"seqcol": {
"schema": {
"description": "A collection of sequences, representing biological sequences including nucleotide or amino acid sequences. For example, a sequence collection may represent the set of chromosomes in a reference genome, the set of transcripts in a transcriptome, a collection of reference microbial sequences of a specific gene, or any other set of sequences.",
"type": "object",
"properties":{
"lengths":{
}
}
},
"other properties": { }
}
Also do we expect implementation to declare all the properties they are using in the schema or to refer to our published schema ?
How do those two things go together? is it like this, parallel?
{
"id": "uk.ebi.eva.eva-seqcol",
...
"seqcol": { ... }
}
I think the seqcol section put schema
under a key, as in your second example.We can add other properties later if needed. One we might add could be sorted_name_length_pairs
:
"seqcol": {
"schema": {
"description": "A collection of sequences, representing biological sequences including nucleotide or amino acid sequences. For example, a sequence collection may represent the set of chromosomes in a reference genome, the set of transcripts in a transcriptome, a collection of reference microbial sequences of a specific gene, or any other set of sequences.",
"type": "object",
"properties":{
"lengths":{
}
}
}
"sorted_name_length_pairs": true
}
How do those two things go together? is it like this, parallel?
{ "id": "uk.ebi.eva.eva-seqcol", ... "seqcol": { ... } } Yes that's right: See how it is done for refget in [cram registry](https://www.ebi.ac.uk/ena/cram/sequence/service-info) or the [reference implementation](https://refgetv2.fly.dev/sequence/service-info)
After the last call, I realize @nsheff was right in that I misremembered a bit how $ref
in JSON Schema works. So it IS a way to import other JSON Schema documents, but it has to be full JSON schemas. The reason for this is that when evaluated, it is not as I believed that you could put bits and pieces of JSON together to form a full JSON Schema and then you validate the data. Rather, you start by validating the data against the parent schema, and then when you meet a $ref
keyword, you validate that part of the data against the full JSON Schema that is referenced.
See e.g. : https://json-schema.org/understanding-json-schema/structuring.html#ref
However, as was also mentioned, you can make use of $ref
keywork together with the allOf
keyword to validate the data against several JSON Schemas (which all need to validate).
So you could do, for instance (in YAML):
$schema: https://json-schema.org/draft/2020-12/schema
allOf:
- $ref: https://raw.github.com/ga4gh/seqcol-spec/.../core_input_schema.json
- $ref: https://raw.github.com/ga4gh/seqcol-spec/.../topology_schema.json
- type: object
properties:
- custom_array:
type: array
collated: true
description: Custom array
items:
type: string
required:
- sequences
- topologies
inherent:
- custom_array
Where the core_input_schema.json
is the one now described in the docs, while the topology_schema.json
could be something like this:
$schema: https://json-schema.org/draft/2020-12/schema
type: object
properties:
- topologies:
type: array
collated: true
description: Topology of the sequence
items:
type: string
enum:
- linear
- circular
inherent:
- topologies
This adds the pre-defined and standard topologies
array to the core_input_schema.json
, in addition to a custom array.
While very useful, this patterns leads to a problem regarding how we specify the collated
and inherent
properties. So while a JSON Schema validator will logically handle the required
property automatically as an AND
of the required
lists across the schemas, the inherent
and collated
properties must be explicitly parsed and merged by us using some logic (probably AND
-ing the inherent lists, but what if e.g. a property is defined twice with differing values for collated
?)
So this is finally an attempt to write down some thoughts I have regarding a restructuring of the schema. First, some of the problems I see with the current suggestion:
collection
endpoint) and the input data provided to the compare
endpoint (PUT
version).collated
and inherent
properties. A consequence of this is that standard JSON Schema libraries/validators might not be able to work. Not sure whether they will fail when parsing the new properties or just ignore them, and whether that differs between implementations. In any case, standard validators will not be able to take the collated
and inherent
properties into consideration when validating that the data corresponds to the schema. One example of such a validation could be to check that if collated==True then the length of the arrays should be the same. Such seqcol-specific validation will require the implementation and maintenance of custom seqcol-JSON-Schema-compliant validators, and I'm not sure whether we want to provide that.collated
and inherent
in the seqcol JSON Schema, as they are not used to validate the data. If we do provide custom validators, we break compliance with regular JSON Schema and created headaches for adaptors.We would probably want to make e.g. a reference implementation of an seqcol server customizable, allowing the user to specify a schema. Then the server would need to parse the JSON schema to extract information about which attributes are collated
and/or inherent
. This is not what a JSON schema is made for, and can be problematic to do, as explained in the previous comment:
While very useful, this patterns leads to a problem regarding how we specify the
collated
andinherent
properties. So while a JSON Schema validator will logically handle therequired
property automatically as anAND
of therequired
lists across the schemas, theinherent
andcollated
properties must be explicitly parsed and merged by us using some logic (probablyAND
-ing the inherent lists, but what if e.g. a property is defined twice with differing values forcollated
?)
So my suggestion is quite simple.
collated
and inherent
to the JSON Schema at all, to comply with basic JSON Schema for validating data.seqcol
property, but not under schema
.seqcol
part of this, depending on the existence of a general GA4GH schema for the service info. Possibly, the seqcol
variant of the service info JSON Schema could be an extension of a general GA4GH schema.collated
and inherent
just regular JSON parsing.collated
and inherent
into account should be done outside of JSON Schema.Here's an example ref resolver we wrote in my group:
Just a very brief recap of the essence of the discussion today. I think there are at least three use cases of service info (seqcol
part) that needs to be taken into consideration:
Basically, I argued for splitting 1. from 2. and 3. by using a plain JSON Schema for validation, and a custom JSON structure for 2. and 3. (with an extra JSON Schema for use with validating that a server complies with our custom JSON structure). The counterargument here is that our custom JSON structure would then end up looking very much like the JSON Schema anyway, especially when allowing for future hierarchical extensions, such as for the pangenome. In practice, there will be a duplication of mostly the same information in two different structures. This is a good counterargument that I don't really have a good response to.
Say then, that we do want to use the same JSON Schema structure for all of 1., 2., and 3. Then there is the issue of whether we should allow $ref
in the JSON Schema or not?
A related issue that was discussed was how to manage the relationship between a JSON Schema that is provided with the specification – defining the standard attributes, which of these are required and so on – with a JSON Schema that describes the attributes that are actually supported by a particular server. For validation (use case 1) one way to manage this is through the use of a $ref
to the JSON schema provided by the seqcol standard (which all data in all seqcol-compliant servers must validate towards). This is a particularly weighty argument for allowing the use of $ref
. Another argument is the simplicity to define the JSON Schema as a true extension (or superset) of standard JSON Schema.
However, if $ref
is allowed, one would for use cases 2 and 3 need to make use of a JSON Schema parser that extracts info through $ref
references ($ref
resolver). The fact that such use is really outside the scope of JSON Schema might make this a cumbersome requirement for adopters to manage.
I think that is the gist of it!...
regarding extending JSON schema:
https://json-schema.org/draft/2020-12/json-schema-core#section-6.5
6.5. Extending JSON Schema Additional schema keywords and schema vocabularies MAY be defined by any entity. Save for explicit agreement, schema authors SHALL NOT expect these additional keywords and vocabularies to be supported by implementations that do not explicitly document such support. Implementations SHOULD treat keywords they do not support as annotations, where the value of the keyword is the value of the annotation. Implementations MAY provide the ability to register or load handlers for vocabularies that they do not support directly. The exact mechanism for registering and implementing such handlers is implementation-dependent.
I interpret this to mean what we're doing to add inherent
and collated
is explicitly approved by the JSON-schema spec.
The service-info for sequence collections will need to inherit from this specification but additional fields can be added. This issue is to discuss what are the fields that should be declared in the seqcol's service info.
For examples, we could add the seqCol schema in the service-info.