ga4gh-beacon / specification

GA4GH Beacon specification.
Apache License 2.0
32 stars 25 forks source link

Add mateName for fusions (issue #255) #256

Closed sdelatorrep closed 5 years ago

sdelatorrep commented 5 years ago

New PR to replace #176

I've created a new type called MateName because when using $ref in some locations of the definition, the description cannot be overridden and it would be the same as in type Chromosome.

        - name: mateName
          description: |
            Second chromosome for fusion events. This can be
            * empty (no fusion or unknown partner)
            * identical to `referenceName` (e.g. one side of an inversion)
            * a different chromosome
          in: query
          required: false
          schema:
            $ref: '#/components/schemas/MateName'

vs

components:
  schemas:
        ...
        mateName:
          $ref: '#/components/schemas/MateName'
mbaudis commented 5 years ago

@sdelatorrep but why does mateName need a separate component & cannot reference to Chromosome? So the description of Chromosome in components would be neutral/minimal, and only query.parameters.referenceName and query.parameters.mateName have a description of the usage. A mateName object is just a chromosome (this would stay the same concept if we use other types of references).

The component could then just be:

    Chromosome:
      description: 'A chromosome of a reference assembly, to which base coordinates are being referenced.'

(Also getting rid of the multiple definitions of chromosome names, which lead to inconsistencies - the component's enum values are sufficient.)

mbaudis commented 5 years ago

@mcupak Thanks for the comments!

mbaudis commented 5 years ago

@mcupak ... regarding the naming: mateName follows referenceName style. Seems logical; however, I'm not married to the name.

sdelatorrep commented 5 years ago

Hi @mcupak, @mbaudis , it's not possible to do what you are suggesting. You cannot add the description here:

        mateName:
          $ref: '#/components/schemas/MateName'

because it is ignored (description or anything else, see this). This is what I tried to say in my first comment in this PR and this is why I created a new class and didn't reuse the Chromosome one. If we reuse the Chromosome class we will have 2 different descriptions for the field mateName: image and image (it's not possible to change the second one)

mbaudis commented 5 years ago

@sdelatorrep @mcupak So a parameter cannot have a description, if it uses a reference? Ditch Swagger.

Seriously, one has to be able to comment on the structure (i.e. in the component; here that the enum describes chromosomes) and the specific use (i.e. in the parameter, here describing what the chromosomes are being used for). We already stumble upon this, here e.g. in the multiple declarations of what a chromosome/reference ... is.

But if I look at the Swagger example, the description sits with the parameter (there responses.200.description, not the component.

So the way here would be just to have a component Chromosome, which is both referred to by referenceName and mateName, which have their (unique) description values?!

jrambla commented 5 years ago

IMHO this is a "tooling" problem (OpenAPI in this case) and we should look for the less bad option. Thus, in order to refine the scenario and do a more informed decision, let me add another question: @mbaudis are we solid that mateName would be identical to Chromosome, e.g. would we allow for using "MT" in that value?

mbaudis commented 5 years ago

@jrambla Correct interpretation (re: tooling). Yes, there shouldn't be a difference between mate name and reference. I have to admit that I haven't seen MT fusions, but logically there is no reason why they shouldn't be described (reference and mate can be the same, as in the description). So for now, I would prefer different parameters with their own descriptions using the same #ref , instead of having a #ref for each parameter (which totally defies the reason why we use components).

sdelatorrep commented 5 years ago

I'm not sure I understand you, @mbaudis. If we keep the same $ref we will have different descriptions, depending on where you look at:

  1. the definition of mateName in the query parameters of the GET /query endpoint (1st screenshot in my previous comment), or
  2. the definition of mateName in BeaconAlleleRequest class (used in the POST /query endpoint) and we cannot change the description in this case (2nd screenshot in my previous comment).

Is this what you're saying? Or I missed/misunderstood something? Thanks!

mbaudis commented 5 years ago

@sdelatorrep I actually overlooked the second definition... But I guess there is no way around for having both referenceName and mateName described in both... The alternative would be to have no description at all for those in POST & GET (i.e. only

referenceName:
  $ref: '#/components/schemas/Chromosome'
mateName:
  $ref: '#/components/schemas/MateName'

), and have the descriptions with the (separate) components (as you seem to suggest).

While both options solve a different duplication, my problem with the 2nd is that the Chromosome and MateName components are conceptually the same (i.e. chromosome instances).

But I leave the decision to you..

(Btw: the Chromosome component should be renamed to ReferenceName whenever feasible; there are cases where other reference annotations are used - e.g. VMC uses sequence identifiers...)

sdelatorrep commented 5 years ago

My proposal:

      parameters:
        - name: referenceName
          description: 'Reference name.'
          in: query
          required: true
          schema:
            $ref: '#/components/schemas/ReferenceName'
        - name: mateName
          description: |
            Second chromosome for fusion events. This can be
            * empty (no fusion or unknown partner)
            * identical to `referenceName` (e.g. one side of an inversion)
            * a different chromosome
          in: query
          required: false
          schema:
            $ref: '#/components/schemas/ReferenceName'
...
components:
  schemas:
    ReferenceName:
      description: 'Reference name.'
      type: string
      enum:
        - ...
...
    BeaconAlleleRequest:
...
      properties:
        referenceName:
          $ref: '#/components/schemas/ReferenceName'
        mateName:
          $ref: '#/components/schemas/ReferenceName'

And it would look like this in an OpenAPI viewer:

In the second defintion of mateName, the description is different but, at least, generic... don't know if it's the best choice, though, but the only alternative is duplicating the component which has the drawbacks @mbaudis exposed. Let's see what @mcupak says about this now that it's clear what the problem is.

mbaudis commented 5 years ago

@sdelatorrep +1 in this form :-)

jrambla commented 5 years ago

Ok. Let me state that eventually, being this an spec, I will prefer clarity (two components with clear descriptions) over being concise (one component with not so clear descriptions or less readable ones). But, given that we have not reached a consensus yet, let's do this in a parsimonious way, changing one step at a time (release).

1) As we only have a component now (i.e. Chromosome), let's use only one. As per Sabela's last proposal. 2) BUT, as it is already named "Chromosome", and I don't want to introduce changes one at a time, and I believe that using referenceName in 3 different places is quite confusing, I'll not change the name right now. 3) If that solution of sharing the component proves not clear enough to the community, we should go in the direction of having two components instead of one. We will keep "Chromosome" in the "old spec" and use "referenceName" and "mateName" in the new 4) I will ask Sabela to change "mateName" (not clear enough IMHO) to "mateChromosomeName" (I'll like to use "referenceChromosomeName" also, but "referenceName" has been there from the beginning for reasons that I don't really understand).

Hoping that this reaches consensus enough and allows us to move forward.

mbaudis commented 5 years ago

@jrambla @sdelatorrep The problem with using "chromosome" (anywhere) is that chromosomes are just a subset of the set of possible genome references. The concept is already being stretched by "MT"; but - importantly - many coordinate systems use positions relative to another reference (whatever clone, transcript ...id). The VMC specification explicitly refrains from using chromosome, but states

sequence_id: An id mapping to the Identifier of the external database Sequence

In VCF it is called CHROM, but may be used differently - a scenario not being served from Beacon right now:

CHROM - chromosome: An identifier from the reference genome or an angle-bracketed ID String (“<ID>”) pointing to a contig in the assembly file

Now, I much prefer the "chromosome from a reference genome edition" way to express coordinates, with resources taking care of the mappings. But something like the semantic abuse in VCF would drive me mad. So either we really say "Beacon is for reference genome mapped data, so there be chromosomes", or we try to get rid of the "chromosome" concept in attribute names etc. and do something like

"referenceName" :
  "description" : |
    The identifier of a genomic sequence assembly to which coordinate positions refer. It is recommended to use chromosomes of a reference genome assembly."
...
mbaudis commented 5 years ago

@jrambla @sdelatorrep ... so if going the "chromosomes only" way, "mateChromosomeName" would be an (expressive) option. But it's a lock-in into "chromosome space" you'll never get rid of.

garysaunders commented 5 years ago

I have similar concerns about this as expressed by @mbaudis. With the new Use Cases in Plants and Proteomics being worked on I would like us not to tie into chromosomes only.

jrambla commented 5 years ago

@mbaudis, @garysaunders we are currently tied to human chromosomes, as we are using an enum. Neither contigs, nor any other "artifact" would accepted. Thus, we can't honor your concerns without reconsidering the whole "reference system" in the Beacon. Indeed, we should probably look at if refget "principals" are applicable here. Beacon v2,0 could be a good target for that. I'm not against that, but as I've said, lets introduce changes step by step. Lets not block the spec release and let it evolve.

mbaudis commented 5 years ago

@jrambla @garysaunders @sdelatorrep So neither the use of "chromosome" nor the enum (not a good idea, but ...) will block non-human use cases ; it will just lead to painful "please give me "Z" etc. issues. Proteins will need a different solution anyway (other request type, I guess).

So best compromise IMO to go ahead with mateName (similar to referenceName; more verbosity isn't really necessary since description etc.), and tie this for now to the Chromosome component. This will keep the general syntax in line; if references become something very different, this may then later be added as either new attributes or different components.

But if @jrambla really wants mateChromosomeName he can have it :-)

garysaunders commented 5 years ago

+1 to @mbaudis "So best compromise IMO to go ahead with mateName (similar to referenceName; more verbosity isn't really necessary since description etc.), and tie this for now to the Chromosome component."

sdelatorrep commented 5 years ago

Hi @garysaunders ! This PR is stalled since quite a long time waiting for Michael and others to review it. Please, could you help to chase them? Thanks!

mbaudis commented 5 years ago

Ouch. Done.

teemukataja commented 5 years ago

@mbaudis @sdelatorrep

Now that we have a mateNameproperty, do we also require a mateStart property (and mateStartMin and mateStartMax to cover a wider area)?

Looking at https://samtools.github.io/hts-specs/VCFv4.3.pdf chapter 5.4.4 page 20.

A fusion variant should probably have the mate's coordinate as well.

How would one search for a fusion variant? What search terms are required? e.g., referenceName, start, mateName, mateStart for 1 : 1000 - 2 : 2000 or variantType in 1 : 1000 > BND for any mate