ga4gh-beacon / specification

GA4GH Beacon specification.
Apache License 2.0
32 stars 25 forks source link

Is mateName missing something? #277

Open teemukataja opened 5 years ago

teemukataja commented 5 years ago

https://github.com/ga4gh-beacon/specification/pull/256 added a new property called mateName as a parameter to a variant query. Is this new feature incomplete? Should mateName be paired up with a coordinate to specify where in the mate chromosome the bonding happens?

Looking at https://samtools.github.io/hts-specs/VCFv4.3.pdf chapter 5.4.4 page 20 for reference.

How would one write a mateName query? We would probably need mateStart, mateStartMin and mateStartMax in addition to the newly created parameter.

Queries would then look something like this for example: Using referenceName, start, mateName, mateStart for 1 : 1000 - 2 : 2000 or using variantType as 1 : 1000 > BND

mbaudis commented 5 years ago

@teemukataja In the current proposal, mateName would be a specification for the end position. A BND with a specified mateName would correspond to a translocation if on different chromosome.

          description: |
            Second chromosome for fusion events. This can be
            * empty (no fusion or unknown partner)
            * identical to `referenceName` (e.g. one side of an inversion)
            * a different chromosome

IMO we don't need a separate mateStart; just specifying that the chromosomes should be ordered (for search):

"reference_name" : "8",
"start_min": 128400000,
"start_max" : 129400000,
"mate_name" : "22",
"end_min" : 23250000,
"end_max" : 23280000,

(comments also on https://github.com/ga4gh-beacon/specification/pull/256#issuecomment-476106086).

blankdots commented 5 years ago

@mbaudis Could you provide any example queries (e.g. POST or GET) and responses (JSON response) on how this functionality can be utilised? I could not find any in the issues or in the API specs.

I would like also to validate some assumptions:

mbaudis commented 5 years ago

@teemukataja SAee the example above, corresponding to an imprecise fusion event (e.g. a MYC-IGL translocation, variant Burkitt lymphoma). A precise query (which doesnt make much sense, since breakpoints are rarely recurring position-specific):

?referenceName=8&start=1289234404&mateName=22&end=23266044&variantType=BND

This would correspond to 2 lines in VCF, where the corresponding mate would be represented in the ALT and INFO fields:

#CHROM POS ID REF ALT QUAL FILTER INFO
8 1289234404 bnd_A C C]22:23266044] 6 PASS SVTYPE=BND;MATEID=bnd_B
22 23266044 bnd_B A [8:1289234404[A 6 PASS SVTYPE=BND;MATEID=bnd_A

The VCF contains additional information about the directionality of the fusion which we don't consider right now (not really important for query models but could be specified later on).

The following would be a typical variation of the query, in which we look for a fusion between canonical breakpoint regions using range matches (same genes):

?referenceName=8&startMin=128400000&startMax=129400000&mateName=22&endMin=23250000&endMax=23280000&variantType=BND

Current Beacon responses would be just standard. Since in example 2 multiple fusion events could be matched, we could deliver the different matched variants (in some TBD format) in the response (either through handover or in the response message - other discussion).

mbaudis commented 5 years ago

@teemukataja For BND variant queries w/o a mateName, all types of variants representing a structural sequence disruption could be queried. In our Beacon+ instance, we just match e.g. on the start and end positions of CNV events; obviously BND; possibly INS ...