Define what the service info will contain

ga4gh / refget

GA4GH Refget specifications docs

https://ga4gh.github.io/refget

14 stars 7 forks source link

Define what the service info will contain #39

Open tcezard opened 1 year ago

tcezard commented 1 year ago

The service-info for sequence collections will need to inherit from this specification but additional fields can be added. This issue is to discuss what are the fields that should be declared in the seqcol's service info.

For examples, we could add the seqCol schema in the service-info.

nsheff commented 1 year ago

Conclusion from today's discussion:

At this point, the schema is probably the only thing.

In the future, if we specify some ways to do some of the optional things, like:

Does the API allow inputting lower-level digests?
Does the compare endpoint allow you to POST 2 remote sequence collections? Or does it require 1 to be a digest?

The service-info endpoint could specify that these are implemented. But only after the specification defines how these would be implemented if they are.

nsheff commented 1 year ago

I think this is actually the same as issue #3

Or, maybe put another way: what is the relationship between /service-info and the /metadata endpoints that we discussed earlier?

tcezard commented 1 year ago

Sequence collection's service info needs to comply with https://github.com/ga4gh-discovery/ga4gh-service-info/blob/master/service-info.yaml and therefore needs some GA4GH content as well as some content specific to sequence collection. The GA4GH content looks like:

{
  "id": "uk.ebi.eva.eva-seqcol",
  "name": "Implementation of the GA4GH Sequence collection retrieval API",
  "type": {
    "group": "org.ga4gh",
    "artifact": "seqcol",
    "version": "0.0.0"
  },
  "organization": {
    "name": "European Variation Archive",
    "url": "https://www.ebi.ac.uk/eva"
  },
  "contactUrl": "mailto:eva-helpdesk@ebi.ac.uk",
  "documentationUrl": "https://www.ebi.ac.uk/eva/....",
  "updatedAt": "date",
  "environment": "dev",
  "version": "tagged version"
}

If threre is nothing else but the schema to specify, the sequence collection specific content could look like:

"seqcol": {
  "description": "A collection of sequences, representing biological sequences including nucleotide or amino acid sequences. For example, a sequence collection may represent the set of chromosomes in a reference genome, the set of transcripts in a transcriptome, a collection of reference microbial sequences of a specific gene, or any other set of sequences.",
    "type": "object",
    "properties":{
      "lengths":{
...
}

But we might want to allow for other properties to define implementation specific feature in the future

"seqcol": {
  "schema": {
    "description": "A collection of sequences, representing biological sequences including nucleotide or amino acid sequences. For example, a sequence collection may represent the set of chromosomes in a reference genome, the set of transcripts in a transcriptome, a collection of reference microbial sequences of a specific gene, or any other set of sequences.",
    "type": "object",
    "properties":{
      "lengths":{
      }
    }
  },
  "other properties": { }
}

Also do we expect implementation to declare all the properties they are using in the schema or to refer to our published schema ?

nsheff commented 1 year ago

How do those two things go together? is it like this, parallel?

{
  "id": "uk.ebi.eva.eva-seqcol",
  ...
  "seqcol": { ... }
}

I think the seqcol section put schema under a key, as in your second example.We can add other properties later if needed. One we might add could be sorted_name_length_pairs:

"seqcol": {
  "schema": {
    "description": "A collection of sequences, representing biological sequences including nucleotide or amino acid sequences. For example, a sequence collection may represent the set of chromosomes in a reference genome, the set of transcripts in a transcriptome, a collection of reference microbial sequences of a specific gene, or any other set of sequences.",
    "type": "object",
    "properties":{
      "lengths":{
      }
    }
  }
  "sorted_name_length_pairs": true
}

tcezard commented 1 year ago

How do those two things go together? is it like this, parallel?


{
 "id": "uk.ebi.eva.eva-seqcol",
 ...
 "seqcol": { ... }
}
Yes that's right: See how it is done for refget in [cram registry](https://www.ebi.ac.uk/ena/cram/sequence/service-info) or the [reference implementation](https://refgetv2.fly.dev/sequence/service-info)

sveinugu commented 1 year ago

After the last call, I realize @nsheff was right in that I misremembered a bit how $ref in JSON Schema works. So it IS a way to import other JSON Schema documents, but it has to be full JSON schemas. The reason for this is that when evaluated, it is not as I believed that you could put bits and pieces of JSON together to form a full JSON Schema and then you validate the data. Rather, you start by validating the data against the parent schema, and then when you meet a $refkeyword, you validate that part of the data against the full JSON Schema that is referenced.

See e.g. : https://json-schema.org/understanding-json-schema/structuring.html#ref

However, as was also mentioned, you can make use of $ref keywork together with the allOf keyword to validate the data against several JSON Schemas (which all need to validate).

So you could do, for instance (in YAML):

$schema: https://json-schema.org/draft/2020-12/schema
allOf:
  - $ref: https://raw.github.com/ga4gh/seqcol-spec/.../core_input_schema.json
  - $ref: https://raw.github.com/ga4gh/seqcol-spec/.../topology_schema.json
  - type: object
    properties:
      - custom_array:
          type: array
          collated: true
          description: Custom array
          items:
            type: string
    required:
      - sequences
      - topologies
    inherent:
      - custom_array

Where the core_input_schema.json is the one now described in the docs, while the topology_schema.json could be something like this:

$schema: https://json-schema.org/draft/2020-12/schema
type: object
properties:
  - topologies:
      type: array
      collated: true
      description: Topology of the sequence
      items:
        type: string
        enum:
          - linear
          - circular
inherent:
  - topologies

This adds the pre-defined and standard topologies array to the core_input_schema.json, in addition to a custom array.

While very useful, this patterns leads to a problem regarding how we specify the collated and inherent properties. So while a JSON Schema validator will logically handle the required property automatically as an AND of the required lists across the schemas, the inherent and collated properties must be explicitly parsed and merged by us using some logic (probably AND-ing the inherent lists, but what if e.g. a property is defined twice with differing values for collated?)

sveinugu commented 11 months ago

So this is finally an attempt to write down some thoughts I have regarding a restructuring of the schema. First, some of the problems I see with the current suggestion:

JSON Schema is made for validating data. So one should be able to use the seqcol JSON schema to validate both the output data (from the collection endpoint) and the input data provided to the compare endpoint (PUT version).
The seqcol JSON Schema has custom additions, for now the collated and inherent properties. A consequence of this is that standard JSON Schema libraries/validators might not be able to work. Not sure whether they will fail when parsing the new properties or just ignore them, and whether that differs between implementations. In any case, standard validators will not be able to take the collated and inherent properties into consideration when validating that the data corresponds to the schema. One example of such a validation could be to check that if collated==True then the length of the arrays should be the same. Such seqcol-specific validation will require the implementation and maintenance of custom seqcol-JSON-Schema-compliant validators, and I'm not sure whether we want to provide that.
If we don't provide custom validators, there is really no reason to have collated and inherent in the seqcol JSON Schema, as they are not used to validate the data. If we do provide custom validators, we break compliance with regular JSON Schema and created headaches for adaptors.
We would probably want to make e.g. a reference implementation of an seqcol server customizable, allowing the user to specify a schema. Then the server would need to parse the JSON schema to extract information about which attributes are collated and/or inherent. This is not what a JSON schema is made for, and can be problematic to do, as explained in the previous comment:

While very useful, this patterns leads to a problem regarding how we specify the collated and inherent properties. So while a JSON Schema validator will logically handle the required property automatically as an AND of the required lists across the schemas, the inherent and collated properties must be explicitly parsed and merged by us using some logic (probably AND-ing the inherent lists, but what if e.g. a property is defined twice with differing values for collated?)

sveinugu commented 11 months ago

So my suggestion is quite simple.

We do not add collated and inherent to the JSON Schema at all, to comply with basic JSON Schema for validating data.
Instead, we move this information a level up, under the seqcol property, but not under schema.
In addition, we should create a JSON Schema for validating either the whole service info endpoint, or just the seqcol part of this, depending on the existence of a general GA4GH schema for the service info. Possibly, the seqcol variant of the service info JSON Schema could be an extension of a general GA4GH schema.
Then, any regular JSON Schema library can be used to validate input and output data.
The server does not need to make use of a JSON Schema parser to extract information about which arrays are collated and inherent just regular JSON parsing.
Any extra validation that takes collated and inherent into account should be done outside of JSON Schema.

nsheff commented 11 months ago

Here's an example ref resolver we wrote in my group:

https://github.com/pepkit/pipestat/blob/a923e4bc49bae31e5a8f5a024c1ca84fec01ec2b/pipestat/parsed_schema.py#L276-L307

sveinugu commented 11 months ago

Just a very brief recap of the essence of the discussion today. I think there are at least three use cases of service info (seqcol part) that needs to be taken into consideration:

Using the JSON Schema to validate data, either in the server or by external services/tools (what JSON Schema is made for)
Configuring a server based on the service info, possibly the JSON Schema part (this is not what JSON Schema is made for)
For an external service/tool to query the server to identify information about the attributes it provides (also not what JSON Schema is made for)

Basically, I argued for splitting 1. from 2. and 3. by using a plain JSON Schema for validation, and a custom JSON structure for 2. and 3. (with an extra JSON Schema for use with validating that a server complies with our custom JSON structure). The counterargument here is that our custom JSON structure would then end up looking very much like the JSON Schema anyway, especially when allowing for future hierarchical extensions, such as for the pangenome. In practice, there will be a duplication of mostly the same information in two different structures. This is a good counterargument that I don't really have a good response to.

Say then, that we do want to use the same JSON Schema structure for all of 1., 2., and 3. Then there is the issue of whether we should allow $ref in the JSON Schema or not?

A related issue that was discussed was how to manage the relationship between a JSON Schema that is provided with the specification – defining the standard attributes, which of these are required and so on – with a JSON Schema that describes the attributes that are actually supported by a particular server. For validation (use case 1) one way to manage this is through the use of a $ref to the JSON schema provided by the seqcol standard (which all data in all seqcol-compliant servers must validate towards). This is a particularly weighty argument for allowing the use of $ref. Another argument is the simplicity to define the JSON Schema as a true extension (or superset) of standard JSON Schema.

However, if $ref is allowed, one would for use cases 2 and 3 need to make use of a JSON Schema parser that extracts info through $ref references ($ref resolver). The fact that such use is really outside the scope of JSON Schema might make this a cumbersome requirement for adopters to manage.

I think that is the gist of it!...

nsheff commented 11 months ago

regarding extending JSON schema:

https://json-schema.org/draft/2020-12/json-schema-core#section-6.5

6.5. Extending JSON Schema Additional schema keywords and schema vocabularies MAY be defined by any entity. Save for explicit agreement, schema authors SHALL NOT expect these additional keywords and vocabularies to be supported by implementations that do not explicitly document such support. Implementations SHOULD treat keywords they do not support as annotations, where the value of the keyword is the value of the annotation. Implementations MAY provide the ability to register or load handlers for vocabularies that they do not support directly. The exact mechanism for registering and implementing such handlers is implementation-dependent.

I interpret this to mean what we're doing to add inherent and collated is explicitly approved by the JSON-schema spec.