ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 114 forks source link

Variant info #752

Open mbaudis opened 7 years ago

mbaudis commented 7 years ago

The VCF specifications express a number of (optional) variant parameters through use of the INFO field. This is e.g. necessary for the representation of any structural variant (END, SVLEN, CIPOS, CIEND). While generally necessary to have a representation of these variants, specifically the Beacon schema is now implementing support for structural variants.

Currently, such information in principle can be expressed through the Variant.info https://github.com/ga4gh/schemas/blob/master/src/main/proto/ga4gh/variants.proto#L186 field; however, this does not provide any guarantee for proper formats etc.

We propose to express data corresponding to the VCF INFO field in a variant_info object. Example implementation below. This is just a first draft, specifically addressing the needs discussed for Beacon development.

In contrast to VCF where END is expresssed in INFO, the current variants schema has a specific end attribute.

message VariantInfo {
  string svtype = 1;
  int64 svlen = 2;
  repeated int32 cipos = 3;
  repeated int32 ciend = 4; 
}
david4096 commented 7 years ago

Thanks @mbaudis, This is great! It shows how the ga4gh data model can be used to model a wider variety of genomic features in a standard way. In the first iteration, maybe we would add these named fields to the Variant message. But these are annotations given to the alternate_bases, which can be multiply valued in the data model.

For structural variants, in practice, will there be more than one alternate_base being called? If not, we might be safe to add these as fields directly to the variant message. If they are expected to be different for different values of alternate base, we need to provide info on each alternate base message directly.

This is problematic because you are trying to reconcile two arrays with each other in the same message. One option would be to add some complexity the alternate_bases by providing the info field nested there. This would avoid the confusion of reconciling arrays in the same message. Another options is to make the alternate_bases array into a singly valued field. This would guarantee that each GA4GH variant message is for a specific sequence modification. In so doing, it becomes more difficult to reconstruct a VCF that has multiply valued alternate_bases, however, I think it is a more clear and robust data model.

Does this data allow us to model fusion variants?

Can we provide a restricted enumeration of values for the svtype?

Should we enumerate the variant types in general and add them to a message when available?

mbaudis commented 7 years ago

@david4096 So: Implementing svtype etc. as 1st class named fields in Variant (like the current end)? Or VariantInfo object as collector for all named variant things beyond basics?

I am slightly in favour of the first option, and keeping info for odd/possible parameters. But there may be serialisation etc. implications I'm not aware of.

david4096 commented 7 years ago

Yes, that question, and also how do we handle the fact that alternate bases can be multiply valued if we are only providing the single sv annotation. I also prefer adding them directly as named fields, if possible

mbaudis commented 7 years ago

how do we handle the fact that alternate bases can be multiply valued if we are only providing the single sv annotation

@david4096 Not sure what you mean by "multiply valued"? Its just Vars ∩ Calls; so if other values for cipos, end ..., they are just different variants.

Anyway, SVs are not well suited for optimization through a Vars ∩ Calls approach; they are mostly unique. In our array based collection, we have ~5M calls - representing ~3.5M SVs. And this may be too good already; the base positions of array based variants are technologically bound by the existing probe positions (so segmentation becomes deterministic).

So: From my POV there is no problem to add a first set of SV parameters to Variants

  string svtype = 13;
  int64 svlen = 14;
  repeated int32 cipos = 15;
  repeated int32 ciend = 16;
mbaudis commented 7 years ago

@david4096 So for structural variants (in contrast to small indels), no "alternate_bases" are provided; just the type of alteration (e.g. DUP, DEL); position (= start) and end (INFO:END in VCF; = end) and optional confidence intervals for start and end (INFO:CIPOS and INFO:CIEND in VCF).

Below is an example (actually from our test implementation, but with additional info parameters discussed above):

{
    "_id" : ObjectId("58297ca32ca4591e5a0df054"),
    "id" : "AM_V_1778741",
    "variant_set_id" : "AM_VS_HG18",
    "reference_name" : "10"
    "start" : 579049,
    "end" : 17236099,
    "alternate_bases" : "DUP",
    "reference_bases" : ".",
    "info" : {
          "svlen":16657050,
          "cipos":[
              -1000,
              1000
          ],
          "ciend":[
              -1000,
              1000
          ]
       },
    "calls" : [
        {
            "genotype" : [
                ".",
                "."
            ],
            "call_set_id" : "AM_CS_TCGA-61-1917-01A-01D-0648-01",
            "info" : {
                "segvalue" : 0.5491
            }
        }
    ],
    "created" : ISODate("2016-11-14T08:33:58.202Z"),
    "updated" : ISODate("2016-11-14T08:33:58.202Z"),
}
david4096 commented 7 years ago

The variant data model has the alternate bases as an array. I believe you're suggesting that for variants with svlen info tags, this alternate bases entry will be empty. It seems confusing to me to put DUP in the field as printed. Perhaps another field to enumerate variant type is in order? I expect the alternate_bases to refer to a list of bases, or some sequence defined in the metadata.

One other way to ask, in VCFs is it required that svlen, cipos, and ciend occur only once per line? If so then we are probably safe placing it on the message itself.

mbaudis commented 7 years ago

@david4096 No; here we follow VCF, which would put the type of variant into the ALT column (e.g. as <DUP>). Not saying this is optimal (yes, I get the "confusing"); I would prefer a kind of variantType attribute, as mentioned ... However, this would lead to the acceptance of alternate_bases as optional empty. For structural variants (especially talking DUP/DEL) one usually does not provide the sequence being altered itself (imagine whole chromosome duplications ...).

So:

mbaudis commented 7 years ago

@david4096 This variation of the example

Comments?

(Btw.: It seems odd to have "reference_bases" as string & "alternate_bases" as a list.)

{
...
    "reference_name" : "10"
    "start" : 579049,
    "end" : 17236099,
    "reference_bases" : ".",
    "alternate_bases" : [],
        "variant_type":"DUP",
        "cipos":[
           -1000,
           1000
        ],
        "ciend":[
           -1000,
           1000
        ],
    "info" : {
          "svlen":16657050,
        },
    "calls" : [
...
        ]
}
jjfarrell commented 7 years ago

I am not a fan of the list in the Alt field either. To working with VCFs, there are many utilities that split multiallelic sites into biallelic records for working with vcf files (see bcftools norm -m -). This is useful for analysis. A lot of programs break when they come across a multiallelic variant. Storing multi-allelic variants a biallelic has been discussed before but I am not sure what the final consensus was for the schema.

Many SV programs indicate whether the call was PRECISE or IMPRECISE. The precise calls than provide the full ref and alt sequences. IMPRECISE would put an or in the alt. Programs are getting better at finding larger and larger precise INDELs so storing the complete REF or ALT can use a lot of space. For example, Delly when it calls a precise SV that is greater than 500 bp will store place a in the alt field for a deletion. The REF string for the deletion then becomes the base before the event so save space rather than the whole deletion.

On Fri, Dec 9, 2016 at 3:50 AM, Michael Baudis notifications@github.com wrote:

@david4096 https://github.com/david4096 This variation of the example

  • changes alternate_bases to an empty list
  • adds variant_type
  • moves cipos, ciend out of info
  • keeps svlen in info, since it seems redundant (¿ special cases with known length but fuzzy position possible ?)

Comments?

(Btw.: It seems odd to have "reference_bases" as string & "alternate_bases" as a list.)

{ ... "reference_name" : "10" "start" : 579049, "end" : 17236099, "reference_bases" : ".", "alternate_bases" : [], "variant_type":"DUP", "cipos":[ -1000, 1000 ], "ciend":[ -1000, 1000 ], "info" : { "svlen":16657050, }, "calls" : [ ... ] }

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ga4gh/schemas/issues/752#issuecomment-265963556, or mute the thread https://github.com/notifications/unsubscribe-auth/AB3rDTJnqiHa_x553HJ_roH22hM5L6E7ks5rGRY5gaJpZM4LGc5w .

-- John Farrell, Ph.D. Biomedical Genetics-Evans 218 Boston University Medical School 72 East Concord Street Boston, MA

ph: 617-638-5491

mbaudis commented 7 years ago

@jjfarrell (@david4096, @KyleGao, @kozbo ...) Thanks for this comment/background/opinion! Fortunately, this is in line with my take ...

Now this can be decomposed into different points:

  1. getting rid of list in alternate_bases (which has boundary conditions since it will break things)
  2. accepting empty alternate_bases, or what format to use there if no sequence is provided (this is actually the default)
  3. the implementation of a separate variant_type attribute, and what values to document there

We will address 3. ASAP with a PR containing

  string variant_type = 13;
  int64 svlen = 14;
  repeated int32 cipos = 15;
  repeated int32 ciend = 16;

... and related documentation. Refactoring to address 1. is left to the code integration people and other interested parties.

ljdursi commented 7 years ago

I'd be cautious about forbidding multiple alternates in the schema itself. For some (many? most?) analyses it's best to remove them, but in some they're necessary - it's very natural to represent overlapping variants in one donor (say, germline/somatic) using multiple alternates, and a mess otherwise. And there some-if few-genuinely triallelic polymorphisms out there, and genotypes are much simpler to deal with if they're listed with one entry.

(Of course, if you've got multiple alternate alleles at one locus, it would be nicer to have a list of alternate alleles each with the INFO fields, etc specific to that allele, rather than a list of alts, INFOs some of which are by-alt lists, etc. This would also simplify comparison across data sets where one was breakmulti'd and one wasn't.)

I think the reason the breakmulti-vs-don't discussion comes up every so often where bioinformaticians get together is that VCF as it stands isn't brilliant at representing overlapping variants - or any complex deviation from the linear reference genome. And I think that's part of the issue with the SVs, too. I've tried to write a simple ensemble caller from callsets from different SV callers, and it's a mess, as the same event - even if seen by multiple callers - can be expressed a number of different ways (and that's even for simple, non-overlapping SVs). Really, something like an assembly-graph formats (like GFA or FASTG) with INFO and FORMAT like annotations would be a lot cleaner for describing new breakpoint pairings than trying to stuff the calls into a VCF representation.

Would getting input from the GA4GH graph-genome group or the benchmarking group (who have built tools for better comparing complex variants between callers) be useful here? A lot of effort has gone into being able to meaningfully compare different representations of complex variants, and it seems like those insights would be useful in defining the underlying data model being developed here.

david4096 commented 7 years ago

Excellent discussion! For this issue, let's close when variant type, svlen, and cipos/end have been added to the variants schema. Please leave your input on whether the proposed representation at the variant level is appropriate and that my concern regarding multiple alts doesn't affect structural variants.

I believe we can move forward with @mbaudis proposal 3.

Further discussion on bi-allelic variants here: https://github.com/ga4gh/schemas/issues/754

mbaudis commented 7 years ago

This has been taken up in PR https://github.com/ga4gh/schemas/pull/772 .