ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 110 forks source link

Multiple alts per variant? #169

Closed brendanofallon closed 9 years ago

brendanofallon commented 9 years ago

I'm curious to hear people's opinions regarding whether or not a variant can or should encompass multiple alt alleles. It seems that version 0.5 of the variants.avdl supports multiple alts per variant, as alt is an array of strings and not a single string. FWIW, I think a variant having multiple alts is a bit confusing and leads to some unnecessary complexity. For instance, each alt may overlap different features of interest, will have different annotations, may each be present in distinct sets of samples, etc. This leads to some confusing requirement, for instance, for the set of annotations associated with a multiple-alt variant, only some annotations will apply to each alt allele, and the annotation set would require some sort of mapping back to which alt(s) are tied to the annotation. An alternative would be to represent each alt as a separate variant having the same start position, and restrict the idea of variant to being a single chr-pos-ref-alt combination.
I realize that it may be a bit late in the game to question this, and that VCFs of course permit multiple alts per variant so it's something people are familiar with... but I do think that it makes life more confusing in several ways, and I'm not sure what the upside is.

lh3 commented 9 years ago

@richarddurbin also advocates the mono-allelic model. I still have a few questions. Say we have a VCF line REF=A;ALT=C,T;GT=1/2. Is there a mono-allelic VCF? If so, what is it? REF=A;ALT=C;GT=1/1 and REF=A;ALT=T;GT=1/1? Is the following valid: REF=A;ALT=C;GT=0/1 and REF=A;ALT=T;GT=0/1?

brendanofallon commented 9 years ago

Good point...I'm not sure how, if at all, the single-alt model would work with VCF. If a goal is to make the API as close as possible to VCF then a single-alt strategy might not work. I guess in my view a 'genotype' is not really a property of a variant, it's a label given to a reference position after one has examined the all of variation that spans the site. In the example you give above, one would have to ask which variants exist at a site, then infer a genotype from the result. Usually this would be trivial, but it could require some work if there are multiple complex variations around. I think it's worth noting that we already must do this for somatic variation where there's not a clearly defined 'genotype'.

mlinderm commented 9 years ago

A colleague forwarded this thread to me… I advocated for the mono-allelic model in the context of the ADAM project for many of the reasons listed above. Not necessarily for use during variant calling, but instead for downstream analysis and particularly for joining with other datasets.

In the example above, I think one could use local phasing and possibly the symbolic allele to maintain the same information after "splitting" multi-allelic entries. For example

REF=A;ALT=C,T;GT=1/2

would become something like

REF=A;ALT=C,<NON_REF>;GT=1|2;PS=someUniqueID
REF=A;ALT=T,<NON_REF>;GT=2|1;PS=someUniqueID

The result captures that the two different alleles are in trans and that the other allele, although not explicitly specified, is not the reference. While adding <NON_REF> would seem to only perpetuate the multiple alleles, I think <NON_REF> could and should be a special case represented directly in any schema and not treated as true allele.

richarddurbin commented 9 years ago

I do advocate a mono-allelic representation, but this is not the same as having a VCF file with one ALT per record. That would be a bi-allelic representation. VCF always requires a ref allele and at least one alt allele. The mono-allelic representation of

Say we have a VCF lineREF=A;ALT=C,T;GT=1/2. Is there a mono-allelic VCF? If so, what is it? REF=A;ALT=C;GT=1/1 andREF=A;ALT=T;GT=1/1?

would have three alleles, something like

START=1, END=2, SEQ=A, INFO=REF, COUNT[sample1]=0
START=1, END=2, SEQ=C, COUNT[sample1]=1
START=1, END=2, SEQ=T, COUNT[sample1]=1

Note that this is not VCF, that [START, END) is a semi-open interval, that there is a record for each allele, and that the genotype is given by the allele count of each allele in the sample. External code needs to decide which alleles are mutually exclusive to determine genotypes, when we want them. This is along the lines of the comments from @brendan. My view now is that in fact we actually need traditional genotypes much less than people currently think. We really want the allele_count information, which is the copy number of this allele in the sample. Note the "sample" can potentially be diploid, or haploid (i.e. a phased haplotype).

As @bioinformed pointed out, the VCF spec does not require that overlapping records are merged. There are multiple reasons for this. First, the rules for merging can be complex, especially with long deletion variants, and people may not like short variants and long variants to be combined and lose their identity. Second, for some purposes such as imputation it turns out that making everything bi-allelic in VCF (not mono-allelic!) is both required by some software, and also it turns out improves performance even if the software allows multi-allelic sites. However, in my view if you don't merge the records then you can get into trouble of the sort that the exchange between @lh3 and @bioinformed and myself illustrated. For me this is the core problem with VCF, which at the moment we fudge.

Richard

On 28 Oct 2014, at 22:04, Michael Linderman notifications@github.com wrote:

A colleague forwarded this thread to me… I advocated for the mono-allelic model in the context of the ADAM project for many of the reasons listed above. Not necessarily for use during variant calling, but instead for downstream analysis and particularly for joining with other datasets.

In the example above, I think one could use local phasing and possibly the symbolic allele to maintain the same information after "splitting" multi-allelic entries. For example

REF=A;ALT=C,T;GT=1/2 would become something like

REF=A;ALT=C,;GT=1|2;PS=someUniqueID REF=A;ALT=T,;GT=2|1;PS=someUniqueID The result captures that the two different alleles are in trans and that the other allele, although not explicitly specified, is not the reference. While adding would seem to only perpetuate the multiple alleles, I think could and should be a special case represented directly in any schema and not treated as true allele.

— Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

lh3 commented 9 years ago

Firstly, how would we encode a VCF line POS=10;REF=CT;ALT=C,CA;GT1=0/0;GT2=1/2;GT3=0/2? Would it be

START=10;END=12;SEQ=CT;INFO=REF;CNT1=2;CNT2=0;CNT3=1
START=10;END=12;SEQ=C;CNT1=0;CNT1=1;CNT3=0
START=11;END=12;SEQ=T;INFO=REF;CNT1=2;CNT2=0;CNT3=1
START=11;END=12;SEQ=A;CNT1=0;CNT1=1;CNT3=1

Or perhaps there would be only one reference line? Either way, when we want to get the genotypes, how could we combine the alleles? By [START,END) pair or by having a new object relating alleles? Wouldn’t this bring us back to the mess with VCF and make it worse? With the mono-allelic model, we delay part of the merging problem to the genotyping phase. We have not solved it.

Secondly, for a complex variant that can be represented in different ways in VCF, the mono-allelic model is not easier for merging. We still need to go through a realignment procedure to compare alleles. Even if we ignore genotyping, the mono-allelic model is not a full solution to merging.

Thirdly, how could we keep genotype likelihoods? It would be cumbersome to store them in the mono-allelic model. Or perhaps we want to get rid of them in GA4GH as they are mostly useful for calling/genotyping only?

Generally, I think mono- and multi-allelic models are equivalent at the core. They share the same fundamental problems with edit-based representations. They have the same difficulty in merging when we want to get genotypes. At a higher level, the mono-allelic model is more consistent with annotation as we annotate the function of an allele, usually not a genotype. However, the multi-allelic model of VCF is more convenient when we deal with genotypes. It is also more consistent provided that there are no overlapping VCF records.

IMO, the mono-allelic model would be more useful for a context- or sequence-based representation.

richarddurbin commented 9 years ago

I think I would just have

START=11;END=12;SEQ=T;INFO=REF;CNT1=2;CNT2=0;CNT3=1 START=11;END=12;SEQ=;CNT1=0;CNT1=1;CNT3=0 START=11;END=12;SEQ=A;CNT1=0;CNT1=1;CNT3=1 Note that in this version, as I had it in the Avro at least, I allowed empty sequences for deletions and other joins. But that isn't really the point, one could have two overlapping alternative variants deriving from the reference in different places, which would give multiple REF fields.

You are right that we delay the merging until making genotypes, but at that point we only need consider allele fields which have count > 0. This eliminates most rare alleles, which eliminates the problem with merging across individuals leading to infinite grouping of sites. Secondly, at each position there are at most two alleles that are relevant, by definition, which makes all the algorithms much simpler. So I think we have effectively solved the merging problem, by keeping it bounded.

Your second point about identifiability is valid. This is still an edit representation and we still need a canonical way to define the edits (alleles), unless we transform back into sequence and align. The only win here is that because we are not combining alleles we can define the substrate and the edit in the simplest way possible for each allele, rather than having to share the substrate with other alleles at the same site. I use the word substrate not reference because I would like to define new variant alleles on top of existing ones, e.g. SNPs in insertions.

The third point about genotype likelihoods was raised by Gabor and Gil. We can keep allele count likelihoods, i.e. the probability of seeing the data given that there are 0, 1, 2,... copies of the allele. I convinced myself that this captures what we need. There isn't yet a demonstration of that, but I am talking about this just now - perhaps more on today's call. One way to help think about this is that when there are lots of alleles at a VCF site, say N, then there are N(N+1)/2 genotypes, but there are not that many independent degrees of freedom in the likelihoods.

Richard

On 29 Oct 2014, at 12:26, Heng Li notifications@github.com wrote:

Firstly, how would we encode a VCF line POS=10;REF=CT;ALT=C,CA;GT1=0/0;GT2=1/2;GT3=0/2? Would it be

START=10;END=12;SEQ=CT;INFO=REF;CNT1=2;CNT2=0;CNT3=1 START=10;END=12;SEQ=C;CNT1=0;CNT1=1;CNT3=0 START=11;END=12;SEQ=T;INFO=REF;CNT1=2;CNT2=0;CNT3=1 START=11;END=12;SEQ=A;CNT1=0;CNT1=1;CNT3=1 Or perhaps there would be only one reference line? Either way, when we want to get the genotypes, how could we combine the alleles? By [START,END) pair or by having a new object relating alleles? Wouldn’t this bring us back to the mess with VCF and make it worse? With the mono-allelic model, we delay part of the merging problem to the genotyping phase. We have not solved it.

Secondly, for a complex variant that can be represented in different ways in VCF, the mono-allelic model is not easier for merging. We still need to go through a realignment procedure to compare alleles. Even if we ignore genotyping, the mono-allelic model is not a full solution to merging.

Thirdly, how could we keep genotype likelihoods? It would be cumbersome to store them in the mono-allelic model. Or perhaps we want to get rid of them in GA4GH as they are mostly useful for calling/genotyping only?

Generally, I think mono- and multi-allelic models are equivalent at the core. They share the same fundamental problems with edit-based representations. They have the same difficulty in merging when we want to get genotypes. At a higher level, the mono-allelic model is more consistent with annotation as we annotate the function of an allele, usually not a genotype. However, the multi-allelic model of VCF is more convenient when we deal with genotypes. It is also more consistent provided that there are no overlapping VCF records.

IMO, the mono-allelic model would be more useful for a context- or sequence-based representation.

— Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

fnothaft commented 9 years ago

The third point about genotype likelihoods was raised by Gabor and Gil. We can keep allele count likelihoods, i.e. the probability of seeing the data given that there are 0, 1, 2,... copies of the allele. I convinced myself that this captures what we need. There isn't yet a demonstration of that, but I am talking about this just now - perhaps more on today's call. One way to help think about this is that when there are lots of alleles at a VCF site, say N, then there are N(N+1)/2 genotypes, but there are not that many independent degrees of freedom in the likelihoods.

I am +1 here. Also, if you have the allele count likelihoods, can you eliminate the need for a symbolic alt allele, as is used in gVCF? I believe that you can, but I haven't gone through the math and convinced myself yet.

brendanofallon commented 9 years ago

I agree that there's a tradeoff between ease-of-genotyping and ease-of-annotation for the multi- vs. mono-allelic models. But I think that the genotyping burden is relatively minor in mono-allelic models, especially so if allele counts and/or likelihoods are stored with each variant. On the other hand, I think the additional burden on downstream analysis is pretty significant for multi-allelic models - it's not just annotations that are more difficult, but set operations like intersections & unions, counting and frequency calculations (how many samples have variant X?), even iteration over the set of variants is not trivial. In house, we always convert VCFs to a mono-allelic representation, and it's been successful for us so far. Of course, we do mostly downstream analyses, and we don't write variant callers, so my opinion is probably biased by the type of work I do.

lh3 commented 9 years ago

On the third point, how to calculate "0 copies of the allele"? What is the formula? Anyway, I haven't worked out how to compute the likelihood of a het P(D|[A1,A2]) from some forms of P(D|A1) and P(D|A2) in the generic cases.

I see your point on bounded merge. It is a valid and good point. On a second thought, I think we should drop all reference alleles. For the example above, we can keep it as:

START=11;END=12;SEQ=;CNT1=2,0;CNT2=0,1;CNT3=1,0
START=11;END=12;SEQ=A;CNT1=2,0;CNT2=0,1;CNT3=1,1

In CNTx=r,a, r is the number of reference alleles and a the number of ALT alleles. CNTx needs to be consistent across overlapping multi-alleles (biallele is okay), but if we go for the mono-allelic model, we need to introduce this unfortunate dependency anyway.

@brendanofallon The genotyping problem is not that minor. In the mono-allelic model, we have to impose dependencies across alleles. Extra dependencies make a representation more fragile to errors. In addition, retrieving genotypes across multiple lines in the mono-allelic model and traversing alleles in the multi-allelic model are arguably of the same bioinformatic complexity. Finally, most variant callers effectively use the multi-allelic model as they have to evaluate het likelihoods for all allele combinations.

pgrosu commented 9 years ago

@lh3, maybe I'm looking at it naively, but wouldn't it sort of come to some form like this?

P(D | A1 ^ A2) = P(D ^ A1 ^ A2 ) / P( A1 ^ A2 ), where

P(A1 ^ A2) = P(A1 ^ A2 | D)P(D) + P(A1 ^ A2 | ~D)P(~D)

And for "0 copies of the allele" wouldn't it be something of the form 1 - the rest? Possibly with standardized datasets we could have genotype models of what was in previous datasets, and impute anything we're missing.

richarddurbin commented 9 years ago

Erik Garrison is going to spend some time looking at implementing this with me.

I don't really like adding the reference back to the count information for each allele - that gives the reference a privileged position that I think is a problem.

Also, I think that it is wrong to single-linkage cluster alleles into sites and calculate hets etc. for all. What matters is the pairwise relation between alleles as to whether they are mutually exclusive. This pairwise relationship is not transitive. Consider 10bp deletion A, a SNP at one end B, and a SNP at the other end C. A is exclusive with B and with C, but B and C are compatible. Possible diploid genotypes include ABC as well as AB, AC etc. This is not managed by VCF.

Anyway, we will aim to come up with a document and implementation that people can look at.

Richard

On 30 Oct 2014, at 04:07, Heng Li notifications@github.com wrote:

On the third point, how to calculate "0 copies of the allele"? What is the formula? Anyway, I haven't worked out how to compute the likelihood of a het P(D|[A1,A2]) from some forms of P(D|A1) and P(D|A2) in the generic cases.

I see your point on bounded merge. It is a valid and good point. On a second thought, I think we should drop all reference alleles. For the example above, we can keep it as:

START=11;END=12;SEQ=;CNT1=2,0;CNT2=0,1;CNT3=1,0 START=11;END=12;SEQ=A;CNT1=2,0;CNT2=0,1;CNT3=1,1 In CNTx=r,a, r is the number of reference alleles and a the number of ALT alleles. CNTx needs to be consistent across overlapping multi-alleles (biallele is okay), but if we go for the mono-allelic model, we need to introduce this unfortunate dependency anyway.

@brendanofallon The genotyping problem is not that minor. In the mono-allelic model, we have to impose dependencies across alleles. Extra dependencies make a representation more fragile to errors. In addition, retrieving genotypes across multiple lines in the mono-allelic model and traversing alleles in the multi-allelic model are arguably of the same bioinformatic complexity. Finally, most variant callers effectively use the multi-allelic model as they have to evaluate het likelihoods for all allele combinations.

— Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

lh3 commented 9 years ago

When you separate the reference allele out, you will sometimes need multiple overlapping reference alleles of different lengths, although there is really one reference allele only. This seems inconsistent. Reference allele is anyway special in that the coordinate is determined by the reference.

I agree that pairwise relationship is not transitive. This is similar to the argument that merging in the mono-allelic model is and should be bounded, which I like, too. Nonetheless, variant calling is different. Modern callers do not look for alleles and then cluster them. They start from enumerating haplotypes and use them to evaluate data. These haplotypes possibly harbor several atomic SNPs/INDELs if they are placed on the same reads.

richarddurbin commented 9 years ago

I hear what you say about the reference being special, but I think we should be wary of that. There was an interesting talk at ASHG by Ryan Neff about an attempt to make a population specific reference. They took the majority allele at all loci in ASW and made a new reference based on this, then showed that the result significantly improved both mapping and variant calling in African American samples, compared to the current reference. They drew the conclusion that population specific references are better, but I am pretty sure that their reference would also have been essentially equally better for samples from other populations - it just reflected better the majority of sequences at more sites, I suppose helping to decouple nearby rare variants from each other. This is of course an argument for a variation reference, which should be better than even a majority reference. But it does suggest that if people want a simple linear reference for mapping then tinkering around the edges with decoys may be less important than changing to a majority, or perhaps ancestral as more well-founded, reference would have a bigger gain. Of course with my GRC hat on, this is not the way we want to go!

With respect to variant calling, we certainly plan to look at that. For single samples at least I am confident that the mono-alleleic route will work. For imputation based calling over multiple samples I concede it is more complex. We are likely to need to change the imputation process. I am not sure how much.

Richard

On 31 Oct 2014, at 01:32, Heng Li notifications@github.com wrote:

When you separate the reference allele out, you will sometimes need multiple overlapping reference alleles of different lengths, although there is really one reference allele only. This seems inconsistent. Reference allele is anyway special in that the coordinate is determined by the reference.

I agree that pairwise relationship is not transitive. This is similar to the argument that merging in the mono-allelic model is and should be bounded, which I like, too. Nonetheless, variant calling is different. Modern callers do not look for alleles and then cluster them. They start from enumerating haplotypes and use them to evaluate data. These haplotypes possibly harbor several atomic SNPs/INDELs if they are placed on the same reads.

— Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

mlin commented 9 years ago

The discussion over representation of reference alleles reminds me of the original development of gVCF, in large part to distinguish between reference-match and non-calls (insufficient read coverage etc.) without going back to the read alignments. Representing non-calls could further complicate the CNTx consistency requirements @lh3 mentioned. Very cool nonetheless, looking forward to the proposal!

lh3 commented 9 years ago

In gVCF, we have long stretches of REF alleles not overlapping variants. We just keep that way in the CNTx approach. It still works. What I worry about is artifactual short REF alleles overlapping variants. This is like splitting every variant VCF line into at least two new records. Anyway, it is a relatively minor issue.

pgrosu commented 9 years ago

Richard/Erik - Would there be any exciting updates regarding the document? I am very curious to read it.

Thanks, Paul

skeenan commented 9 years ago

This has been dormant since October. I am closing this in 2 days.

skeenan commented 9 years ago

Closing. This can be reopened if necessary.