Closed xunjieli closed 4 years ago
PL for ref/<*> is interpreted as ref/any alt allele in GATK
Yea, the proximate cause is that GLnexus doesn't equate <*> with a specific alternate allele; when it goes to fill out PL, it notices that it has "zero but no other values, [which is] uninformative anyway" and blanks out the field.
That technical matter aside though, I would raise a few different points:
Hi @mlin. The PLs can be useful for intermediate coverage projects where LD refinement may still be used. I've had success in using Beagle + GATK3.5 VCFs (from GVCF) to improve genotype accuracy over the GATK3.5 calls alone. This is in spite of the fact the PLs must very approximate due to your point 1 above and the heuristics used to calculate a PL when no ALT is seen.
@jaredo rephrasing to make sure I understand correctly -- the use case is to impute a non-reference allele when there are ~no reads supporting it (so long as there are not that many reference-identical reads either) because it's (i) generally frequent, (ii) linked to another nearby & observed allele, or both. Is this correct?
Hi @mlin,
We feel strongly about having a config option to match GATK’s behavior in keeping PLs for ref calls, to support compatibility with downstream tools that require PLs as input. A concrete example is LD-based imputation with Beagle -- imputation with Beagle requires PLs, so if a user is interested in performing that task they currently have no option to use GLnexus, and are instead forced into the GATK GenotypeGVCFs tooling. As mentioned in your discussion with @jaredo above, LD-based imputation can improve genotype calls particularly for low-coverage sequencing.
To respond to the specific points you raised above to @xunjieli :
Regarding PL in a reference band: this is absolutely a heuristic, as you note. DeepVariant doesn’t even provide DP (just MIN_DP) and the PLs are taken from a binomial calculation on reads from only the first position within the block. However, we bin adjacent records by a quantization of GQ so are essentially saying that we’re comfortable with the loss of resolution of information across the window, and as a consequence the PLs pushed into the site are of the first position. (As an aside, in its current implementation, GLnexus appears to drop PLs from these sites even if they are only a single bp long, which seems undesirable).
Regarding probability value uninformativeness: It’s true that in deeply sequenced WGS many reference blocks have nearly zero evidence of non-reference alleles. This is less true at lower coverage sequencing levels though (which are also where the benefits of imputation are maximized).
Regarding file size increase: This is definitely a concern, and I think warrants tucking this functionality behind a flag (set OFF by default) so that users uninterested in applying tools that require PLs at every site are not burdened by the larger data sizes.
@mlin that rephrasing is correct
Rough implementation notes:
PLFieldHelper::combine_format_data
starts off with basically nothing to work with (just a 0 for the 0/0 genotype); because the other two PL values aren't projected in during the NumericFormatFieldHelper::add_record_data
loop; because the allele_mapping
it consults doesn't equate the gVCF symbolic allele with the specific output alternate allele, nor probably should it. Expedient thing may be for PLFieldHelper
to override both NumericFormatFieldHelper::add_record_data
and FormatFieldHelper::get_out_ind_of_value
; the former can set a private flag that a reference band is currently being processed, in which case the latter can lie that the symbolic allele is the desired alternate allele.
The problematic part as usual will be the multiallelic sites with k>2 alleles counting ref -- these have k(k+1)/2 PL entries to fill in from the reference band's 3 PL values. We'd have to make some ~arbitrary choices. Fill in PL[1] for all 0/i genotypes and PL[2] for all i/i genotypes? What to put for i/j genotypes? get_out_ind_of_value
currently has no way to tell its caller to project one input value into multiple output slots...
@mlin Hi Mike, I'd like to chime in here regarding your last paragraph about multiallelic case, because I've done some experiments about specifically that with GLnexus. I think it'd be useful to have a concrete example of the case you mentioned.
Here are three single-sample example gVCFs (with only one record) generated by DeepVariant:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG002
chr20 60001 . T <*> 0 . END=60001 GT:GQ:MIN_DP:PL 0/0:0:13:0,10,100
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG003
chr20 60001 . T C,<*> 37.6 PASS . GT:GQ:DP:AD:VAF:PL 1/1:19:6:0,6,0:1,0:40,30,0,990,990,990
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG004
chr20 60001 . T TA,<*> 37.6 PASS . GT:GQ:DP:AD:VAF:PL 0/1:19:6:3,3,0:1,0:30,0,20,990,990,990
And this is the output from GLnexus after merging (with --config DeepVariant) the three gVCFs:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG002 HG003 HG004
chr20 60001 chr20_60001_T_C;chr20_60001_T_TA T C,TA 40 . AF=0.333333,0.166667;AQ=40,30 GT:DP:AD:GQ:PL:RNC 0/0:13:13,0,0:0:.:.. 1/1:6:0,6,0:19:40,30,0,.,.,.:.. 0/2:6:3,0,3:19:30,.,.,0,.,20:..
As you can see PL fields of HG003 and HG004 are (understandably) only partially filled with the numbers from the caller, and the PL field for HG002 is completely empty (which is the topic of this ticket). Since the <*>
allele in HG002 gVCF record implies any allele other than the reference, I think it'd be reasonable to assume that it represent the rest of the allele set, i.e. <*> = {C, TA}
for the HG002 sample.
Then as you mentioned we'd somehow need to
(1) write the likelihoods for T/C
, T/TA
, given only T/<*>
likelihood, and
(2) the likelihoods for C/C
, C/TA
, TA/TA
, given only <*>/<*>
likelihood.
Given not enough information, we'll need to use some kind of heuristic here, for example distributing the likelihood equally to all, or duplicating values across the pairs (unless we use some prior information). This is definitely a heuristic, but I think having some reasonable numbers will be definitely useful, given many downstream applications that required PL fields as @cmclean, @jaredo, and @xunjieli mentioned. At least logically this would be simple.
In terms of implementation, I haven't looked at the source, but based on your last comment it looks like in the current state of the code there's no way to specify that certain allele is <*>
allele representing multiple alleles as a set. Is that true? And if it is, how hard it would be to add that information?
With k>=2 total alleles (including ref), there are (k-1) ref/alt genotypes, (k-1) altX/altX genotypes, and (k-1)(k-2)/2 altX/altY genotypes. In the absence of differentiation between altX/altX and altX/altY, I'd propose that the likelihood stored in the PL[1] field be equally distributed across the (k-1) ref/alt genotypes, and the likelihood stored in the PL[2] field be equally distributed among the remaining (k-1) + (k-1)(k-2)/2 == k * (k-1) / 2 genotypes that have no reference allele. That seems like a reasonable heuristic.
EDIT: Based on the careful analysis by @tedyun below it seems like duplicating the values from the <*> allele entries is the simplest mathematically justifiable way to propagate info to all alleles.
Thanks! Just some further background on why this isn't the one-liner fix it seems like it should be. PLFieldHelper
inherits from base classes which try to provide a very flexible framework for projecting gVCF format values (including the read depths, strand bias, etc. as well as genotype likelihoods) into the pVCF where the allele set may be different. To further complicate this, the base logic tries to handle combining format values from multiple gVCF records for a sample, which can happen anytime the pVCF site covers >1bp of reference.
In this case the logic is becoming so specialized that I think we probably should just rewrite PLFieldHelper
to ignore the base logic (get_out_ind_of_value
etc.) for reference bands, and hard-code what we want.
I also want to look ahead to meshing the work on this with plans to implement a proposal for how to write the genotype likelihoods for a multiallelic site ("Local Alleles") without PL's quadratic blowup in k. This will shrink and speed everything up.
I'm tied up the next few days but keeping a mental background thread running on this...
I've thought a little bit more about giving (a tiny bit of) theoretical justification to the heuristic for using GLs specified for multiple set of genotypes (e.g. T/<*>
= {T/C
, T/TA
} in the above example).
By just working on the definition of the genotype likelihood, it seems that the GL of a set of genotypes, GL(G_1, G_2, ... G_k) is given by the weighted average of the GLs of the individual genotypes GL(G_i), where the weights are given by the prior probably of the particular genotype. Please see the attached screenshot below for detail (apologies for attaching a picture - I wish I can use LaTeX in GitHub comments..).
This means that, if we're going to use a very simple heuristic, it'll probably be better to duplicate the GL values (of the genotype including a <*>
) in all of the individual genotypes (since we're taking a weighted average of the individual GLs, not a sum). Concretely that means we're setting GL(T/<*>
) = GL(T/C
) = GL(T/TA
) in the above example. Again, this is a very crude heuristic, but at least in this way we can guarantee that the identity above will always hold.
And thank you @mlin for sharing the details about how things are implemented now! Looking forward to hearing back from you about the results from your mental background threads :)
Just a keepalive here -- I'm still pushing on another project but trying to get to this asarp. In the meanwhile, the VCF spec proposal for non-quadratic PL is under active discussion. You all have been extremely insightful on this thread and if you have any spare moments to take a look at that too, I'm sure your perspective could be really helpful. @xunjieli @jaredo @tedyun @cmclean
We noticed that GLnexus omits PL fields for
0/0
calls while GATK's GenotypeGVCFs retains PL fields from the original gVCF.An example: HG002 has no variant at loc 61083. gVCF content:
Merged VCF with GATK 4.0.6.0 GenotypeGVCFs:
Notice that HG002 is 0/0 with PL is
0,300,2999
.Merged VCF with GLnexus:
Notice that HG002 is 0/0 but PL is
.
.Can GLnexus retain the PL fields?
The VCF spec (https://samtools.github.io/hts-specs/VCFv4.3.pdf) seems underspecified in this regard, and PL for
ref/<*>
is interpreted asref/any alt allele
in GATK. If GLnexus can match this behavior, that will greatly facilitate our downstream analysis. I am somewhat lost in the code and not sure how this can be fixed. Or is GLnexus's treatment better?@mlin