Open mbaudis opened 7 years ago
Thanks @mbaudis, This is great! It shows how the ga4gh data model can be used to model a wider variety of genomic features in a standard way. In the first iteration, maybe we would add these named fields to the Variant message. But these are annotations given to the alternate_bases, which can be multiply valued in the data model.
For structural variants, in practice, will there be more than one alternate_base being called? If not, we might be safe to add these as fields directly to the variant message. If they are expected to be different for different values of alternate base, we need to provide info on each alternate base message directly.
This is problematic because you are trying to reconcile two arrays with each other in the same message. One option would be to add some complexity the alternate_bases
by providing the info field nested there. This would avoid the confusion of reconciling arrays in the same message. Another options is to make the alternate_bases
array into a singly valued field. This would guarantee that each GA4GH variant message is for a specific sequence modification. In so doing, it becomes more difficult to reconstruct a VCF that has multiply valued alternate_bases, however, I think it is a more clear and robust data model.
Does this data allow us to model fusion variants?
Can we provide a restricted enumeration of values for the svtype
?
Should we enumerate the variant types in general and add them to a message when available?
@david4096 So: Implementing svtype
etc. as 1st class named fields in Variant
(like the current end
)? Or VariantInfo
object as collector for all named variant things beyond basics?
I am slightly in favour of the first option, and keeping info
for odd/possible parameters. But there may be serialisation etc. implications I'm not aware of.
Yes, that question, and also how do we handle the fact that alternate bases can be multiply valued if we are only providing the single sv annotation. I also prefer adding them directly as named fields, if possible
how do we handle the fact that alternate bases can be multiply valued if we are only providing the single sv annotation
@david4096 Not sure what you mean by "multiply valued"? Its just Vars ∩ Calls; so if other values for cipos
, end
..., they are just different variants.
Anyway, SVs are not well suited for optimization through a Vars ∩ Calls approach; they are mostly unique. In our array based collection, we have ~5M calls - representing ~3.5M SVs. And this may be too good already; the base positions of array based variants are technologically bound by the existing probe positions (so segmentation becomes deterministic).
So: From my POV there is no problem to add a first set of SV parameters to Variants
string svtype = 13;
int64 svlen = 14;
repeated int32 cipos = 15;
repeated int32 ciend = 16;
@david4096 So for structural variants (in contrast to small indels), no "alternate_bases" are provided; just the type of alteration (e.g. DUP, DEL); position (= start
) and end (INFO:END in VCF; = end
) and optional confidence intervals for start and end (INFO:CIPOS and INFO:CIEND in VCF).
Below is an example (actually from our test implementation, but with additional info
parameters discussed above):
{
"_id" : ObjectId("58297ca32ca4591e5a0df054"),
"id" : "AM_V_1778741",
"variant_set_id" : "AM_VS_HG18",
"reference_name" : "10"
"start" : 579049,
"end" : 17236099,
"alternate_bases" : "DUP",
"reference_bases" : ".",
"info" : {
"svlen":16657050,
"cipos":[
-1000,
1000
],
"ciend":[
-1000,
1000
]
},
"calls" : [
{
"genotype" : [
".",
"."
],
"call_set_id" : "AM_CS_TCGA-61-1917-01A-01D-0648-01",
"info" : {
"segvalue" : 0.5491
}
}
],
"created" : ISODate("2016-11-14T08:33:58.202Z"),
"updated" : ISODate("2016-11-14T08:33:58.202Z"),
}
The variant data model has the alternate bases as an array. I believe you're suggesting that for variants with svlen
info tags, this alternate bases entry will be empty. It seems confusing to me to put DUP
in the field as printed. Perhaps another field to enumerate variant type is in order? I expect the alternate_bases to refer to a list of bases, or some sequence defined in the metadata.
One other way to ask, in VCFs is it required that svlen, cipos, and ciend occur only once per line? If so then we are probably safe placing it on the message itself.
@david4096 No; here we follow VCF, which would put the type of variant into the ALT
column (e.g. as <DUP>
). Not saying this is optimal (yes, I get the "confusing"); I would prefer a kind of variantType
attribute, as mentioned ...
However, this would lead to the acceptance of alternate_bases
as optional empty. For structural variants (especially talking DUP/DEL) one usually does not provide the sequence being altered itself (imagine whole chromosome duplications ...).
So:
end
attributealternate_bases
(following the VCF "ALT") to describe a subset of non-base/sequence parameters is not optimal (see VCFv4.3 page 15 "alternate_bases
as optional if SV etc.; there are several ways to do this@david4096 This variation of the example
alternate_bases
to an empty listvariant_type
cipos
, ciend
out of info
svlen
in info
, since it seems redundant (¿ special cases with known length but fuzzy position possible ?)Comments?
(Btw.: It seems odd to have "reference_bases" as string & "alternate_bases" as a list.)
{
...
"reference_name" : "10"
"start" : 579049,
"end" : 17236099,
"reference_bases" : ".",
"alternate_bases" : [],
"variant_type":"DUP",
"cipos":[
-1000,
1000
],
"ciend":[
-1000,
1000
],
"info" : {
"svlen":16657050,
},
"calls" : [
...
]
}
I am not a fan of the list in the Alt field either. To working with VCFs, there are many utilities that split multiallelic sites into biallelic records for working with vcf files (see bcftools norm -m -). This is useful for analysis. A lot of programs break when they come across a multiallelic variant. Storing multi-allelic variants a biallelic has been discussed before but I am not sure what the final consensus was for the schema.
Many SV programs indicate whether the call was PRECISE or IMPRECISE. The
precise calls than provide the full ref and alt sequences. IMPRECISE would
put an or in the alt. Programs are getting better at finding
larger and larger precise INDELs so storing the complete REF or ALT can use
a lot of space. For example, Delly when it calls a precise SV that is
greater than 500 bp will store place a in the alt field for a
deletion. The REF string for the deletion then becomes the base before the
event so save space rather than the whole deletion.
On Fri, Dec 9, 2016 at 3:50 AM, Michael Baudis notifications@github.com wrote:
@david4096 https://github.com/david4096 This variation of the example
- changes alternate_bases to an empty list
- adds variant_type
- moves cipos, ciend out of info
- keeps svlen in info, since it seems redundant (¿ special cases with known length but fuzzy position possible ?)
Comments?
(Btw.: It seems odd to have "reference_bases" as string & "alternate_bases" as a list.)
{ ... "reference_name" : "10" "start" : 579049, "end" : 17236099, "reference_bases" : ".", "alternate_bases" : [], "variant_type":"DUP", "cipos":[ -1000, 1000 ], "ciend":[ -1000, 1000 ], "info" : { "svlen":16657050, }, "calls" : [ ... ] }
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ga4gh/schemas/issues/752#issuecomment-265963556, or mute the thread https://github.com/notifications/unsubscribe-auth/AB3rDTJnqiHa_x553HJ_roH22hM5L6E7ks5rGRY5gaJpZM4LGc5w .
-- John Farrell, Ph.D. Biomedical Genetics-Evans 218 Boston University Medical School 72 East Concord Street Boston, MA
ph: 617-638-5491
@jjfarrell (@david4096, @KyleGao, @kozbo ...) Thanks for this comment/background/opinion! Fortunately, this is in line with my take ...
Now this can be decomposed into different points:
alternate_bases
(which has boundary conditions since it will break things)alternate_bases
, or what format to use there if no sequence is provided (this is actually the default)variant_type
attribute, and what values to document thereWe will address 3. ASAP with a PR containing
string variant_type = 13;
int64 svlen = 14;
repeated int32 cipos = 15;
repeated int32 ciend = 16;
... and related documentation. Refactoring to address 1. is left to the code integration people and other interested parties.
I'd be cautious about forbidding multiple alternates in the schema itself. For some (many? most?) analyses it's best to remove them, but in some they're necessary - it's very natural to represent overlapping variants in one donor (say, germline/somatic) using multiple alternates, and a mess otherwise. And there some-if few-genuinely triallelic polymorphisms out there, and genotypes are much simpler to deal with if they're listed with one entry.
(Of course, if you've got multiple alternate alleles at one locus, it would be nicer to have a list of alternate alleles each with the INFO fields, etc specific to that allele, rather than a list of alts, INFOs some of which are by-alt lists, etc. This would also simplify comparison across data sets where one was breakmulti'd and one wasn't.)
I think the reason the breakmulti-vs-don't discussion comes up every so often where bioinformaticians get together is that VCF as it stands isn't brilliant at representing overlapping variants - or any complex deviation from the linear reference genome. And I think that's part of the issue with the SVs, too. I've tried to write a simple ensemble caller from callsets from different SV callers, and it's a mess, as the same event - even if seen by multiple callers - can be expressed a number of different ways (and that's even for simple, non-overlapping SVs). Really, something like an assembly-graph formats (like GFA or FASTG) with INFO and FORMAT like annotations would be a lot cleaner for describing new breakpoint pairings than trying to stuff the calls into a VCF representation.
Would getting input from the GA4GH graph-genome group or the benchmarking group (who have built tools for better comparing complex variants between callers) be useful here? A lot of effort has gone into being able to meaningfully compare different representations of complex variants, and it seems like those insights would be useful in defining the underlying data model being developed here.
Excellent discussion! For this issue, let's close when variant type, svlen, and cipos/end have been added to the variants schema. Please leave your input on whether the proposed representation at the variant level is appropriate and that my concern regarding multiple alts doesn't affect structural variants.
I believe we can move forward with @mbaudis proposal 3.
Further discussion on bi-allelic variants here: https://github.com/ga4gh/schemas/issues/754
This has been taken up in PR https://github.com/ga4gh/schemas/pull/772 .
The VCF specifications express a number of (optional) variant parameters through use of the
INFO
field. This is e.g. necessary for the representation of any structural variant (END
,SVLEN
,CIPOS
,CIEND
). While generally necessary to have a representation of these variants, specifically the Beacon schema is now implementing support for structural variants.Currently, such information in principle can be expressed through the
Variant.info
https://github.com/ga4gh/schemas/blob/master/src/main/proto/ga4gh/variants.proto#L186 field; however, this does not provide any guarantee for proper formats etc.We propose to express data corresponding to the VCF
INFO
field in avariant_info
object. Example implementation below. This is just a first draft, specifically addressing the needs discussed for Beacon development.In contrast to VCF where
END
is expresssed inINFO
, the current variants schema has a specificend
attribute.