Missing PL fields for 0/0 calls

dnanexus-rnd / GLnexus

Scalable gVCF merging and joint variant calling for population sequencing projects

Apache License 2.0

145 stars 37 forks source link

Missing PL fields for 0/0 calls #173

Closed xunjieli closed 4 years ago

xunjieli commented 5 years ago

We noticed that GLnexus omits PL fields for 0/0 calls while GATK's GenotypeGVCFs retains PL fields from the original gVCF.

An example: HG002 has no variant at loc 61083. gVCF content:

chr20   61001   .   C   <*> 0   .   END=61130   GT:GQ:MIN_DP:PL 0/0:50:94:0,300,2999

Merged VCF with GATK 4.0.6.0 GenotypeGVCFs:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG002   HG003   HG004
chr20   61083   .   C   T   31.18   .   AC=2;AF=0.333;AN=6;DP=94;ExcessHet=3.9794;MLEAC=2;MLEAF=0.333;QD=0.3    GT:AD:DP:GQ:PL:VAF  0/0:94,0:94:99:0,300,2999:. 0/1:29,23:.:27:27,0,49:0.442,0  0/1:35,17:.:36:36,0,59:0.321,0

Notice that HG002 is 0/0 with PL is 0,300,2999.

Merged VCF with GLnexus:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG002   HG003   HG004
chr20   61083   chr20_61083_C_T C   T   36  .   AF=0.333333;AQ=36   GT:DP:AD:GQ:PL:RNC  0/0:94:94,0:50:.:.. 0/1:52:29,23:27:27,0,49:..  0/1:53:35,17:37:36,0,59:..

Notice that HG002 is 0/0 but PL is ..

Can GLnexus retain the PL fields?

The VCF spec (https://samtools.github.io/hts-specs/VCFv4.3.pdf) seems underspecified in this regard, and PL for ref/<*> is interpreted as ref/any alt allele in GATK. If GLnexus can match this behavior, that will greatly facilitate our downstream analysis. I am somewhat lost in the code and not sure how this can be fixed. Or is GLnexus's treatment better?

@mlin

mlin commented 5 years ago

PL for ref/<*> is interpreted as ref/any alt allele in GATK

Yea, the proximate cause is that GLnexus doesn't equate <*> with a specific alternate allele; when it goes to fill out PL, it notices that it has "zero but no other values, [which is] uninformative anyway" and blanks out the field.

That technical matter aside though, I would raise a few different points:

How is one supposed to interpret PL in the GVCF reference band covering a lengthy region (130bp in this example)? Over such a large region, what useful information do they really convey beyond DP and MIN_DP? I don't know and that gives me pause before propagating those values into a SNV site in the project VCF where it might appear much more specifically informative about that position than it actually was.
The probability values involved are usually of such magnitude as to make it pointless to do any quantitative calculations from them. In the above example 0/0 is 10^30 times more likely than the the second-place genotype. That's on top of an ascertainment bias from looking at a reference band in the first place: if there were remotely plausible evidence for a non-reference genotype, we should expect to see a VCF record identifying a specific POS+ALT even if it caller GT=0/0.
But they would inflate the output file size, possibly by a lot, considering they would be added to almost all of the genotype matrix entries, and go quadratic in multiallelic sites.

jaredo commented 5 years ago

Hi @mlin. The PLs can be useful for intermediate coverage projects where LD refinement may still be used. I've had success in using Beagle + GATK3.5 VCFs (from GVCF) to improve genotype accuracy over the GATK3.5 calls alone. This is in spite of the fact the PLs must very approximate due to your point 1 above and the heuristics used to calculate a PL when no ALT is seen.

mlin commented 5 years ago

@jaredo rephrasing to make sure I understand correctly -- the use case is to impute a non-reference allele when there are ~no reads supporting it (so long as there are not that many reference-identical reads either) because it's (i) generally frequent, (ii) linked to another nearby & observed allele, or both. Is this correct?

cmclean commented 5 years ago

Hi @mlin,

We feel strongly about having a config option to match GATK’s behavior in keeping PLs for ref calls, to support compatibility with downstream tools that require PLs as input. A concrete example is LD-based imputation with Beagle -- imputation with Beagle requires PLs, so if a user is interested in performing that task they currently have no option to use GLnexus, and are instead forced into the GATK GenotypeGVCFs tooling. As mentioned in your discussion with @jaredo above, LD-based imputation can improve genotype calls particularly for low-coverage sequencing.

To respond to the specific points you raised above to @xunjieli :

Regarding PL in a reference band: this is absolutely a heuristic, as you note. DeepVariant doesn’t even provide DP (just MIN_DP) and the PLs are taken from a binomial calculation on reads from only the first position within the block. However, we bin adjacent records by a quantization of GQ so are essentially saying that we’re comfortable with the loss of resolution of information across the window, and as a consequence the PLs pushed into the site are of the first position. (As an aside, in its current implementation, GLnexus appears to drop PLs from these sites even if they are only a single bp long, which seems undesirable).
Regarding probability value uninformativeness: It’s true that in deeply sequenced WGS many reference blocks have nearly zero evidence of non-reference alleles. This is less true at lower coverage sequencing levels though (which are also where the benefits of imputation are maximized).
Regarding file size increase: This is definitely a concern, and I think warrants tucking this functionality behind a flag (set OFF by default) so that users uninterested in applying tools that require PLs at every site are not burdened by the larger data sizes.

jaredo commented 5 years ago

@mlin that rephrasing is correct

mlin commented 5 years ago

Rough implementation notes:

PLFieldHelper::combine_format_data starts off with basically nothing to work with (just a 0 for the 0/0 genotype); because the other two PL values aren't projected in during the NumericFormatFieldHelper::add_record_data loop; because the allele_mapping it consults doesn't equate the gVCF symbolic allele with the specific output alternate allele, nor probably should it. Expedient thing may be for PLFieldHelper to override both NumericFormatFieldHelper::add_record_data and FormatFieldHelper::get_out_ind_of_value; the former can set a private flag that a reference band is currently being processed, in which case the latter can lie that the symbolic allele is the desired alternate allele.

The problematic part as usual will be the multiallelic sites with k>2 alleles counting ref -- these have k(k+1)/2 PL entries to fill in from the reference band's 3 PL values. We'd have to make some ~arbitrary choices. Fill in PL[1] for all 0/i genotypes and PL[2] for all i/i genotypes? What to put for i/j genotypes? get_out_ind_of_value currently has no way to tell its caller to project one input value into multiple output slots...

tedyun commented 5 years ago

@mlin Hi Mike, I'd like to chime in here regarding your last paragraph about multiallelic case, because I've done some experiments about specifically that with GLnexus. I think it'd be useful to have a concrete example of the case you mentioned.

Here are three single-sample example gVCFs (with only one record) generated by DeepVariant:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  HG002
chr20   60001   .   T   <*> 0   .   END=60001   GT:GQ:MIN_DP:PL 0/0:0:13:0,10,100

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  HG003
chr20   60001   .   T   C,<*>   37.6    PASS    .   GT:GQ:DP:AD:VAF:PL  1/1:19:6:0,6,0:1,0:40,30,0,990,990,990

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  HG004
chr20   60001   .   T   TA,<*>  37.6    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:19:6:3,3,0:1,0:30,0,20,990,990,990

And this is the output from GLnexus after merging (with --config DeepVariant) the three gVCFs:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  HG002   HG003   HG004
chr20   60001   chr20_60001_T_C;chr20_60001_T_TA    T   C,TA    40  .   AF=0.333333,0.166667;AQ=40,30   GT:DP:AD:GQ:PL:RNC  0/0:13:13,0,0:0:.:..    1/1:6:0,6,0:19:40,30,0,.,.,.:.. 0/2:6:3,0,3:19:30,.,.,0,.,20:..

As you can see PL fields of HG003 and HG004 are (understandably) only partially filled with the numbers from the caller, and the PL field for HG002 is completely empty (which is the topic of this ticket). Since the <*> allele in HG002 gVCF record implies any allele other than the reference, I think it'd be reasonable to assume that it represent the rest of the allele set, i.e. <*> = {C, TA} for the HG002 sample.

Then as you mentioned we'd somehow need to (1) write the likelihoods for T/C, T/TA, given only T/<*> likelihood, and (2) the likelihoods for C/C, C/TA, TA/TA, given only <*>/<*> likelihood.

Given not enough information, we'll need to use some kind of heuristic here, for example distributing the likelihood equally to all, or duplicating values across the pairs (unless we use some prior information). This is definitely a heuristic, but I think having some reasonable numbers will be definitely useful, given many downstream applications that required PL fields as @cmclean, @jaredo, and @xunjieli mentioned. At least logically this would be simple.

In terms of implementation, I haven't looked at the source, but based on your last comment it looks like in the current state of the code there's no way to specify that certain allele is <*> allele representing multiple alleles as a set. Is that true? And if it is, how hard it would be to add that information?

cmclean commented 5 years ago

With k>=2 total alleles (including ref), there are (k-1) ref/alt genotypes, (k-1) altX/altX genotypes, and (k-1)(k-2)/2 altX/altY genotypes. In the absence of differentiation between altX/altX and altX/altY, I'd propose that the likelihood stored in the PL[1] field be equally distributed across the (k-1) ref/alt genotypes, and the likelihood stored in the PL[2] field be equally distributed among the remaining (k-1) + (k-1)(k-2)/2 == k * (k-1) / 2 genotypes that have no reference allele. That seems like a reasonable heuristic.

EDIT: Based on the careful analysis by @tedyun below it seems like duplicating the values from the <*> allele entries is the simplest mathematically justifiable way to propagate info to all alleles.

mlin commented 5 years ago

Thanks! Just some further background on why this isn't the one-liner fix it seems like it should be. PLFieldHelper inherits from base classes which try to provide a very flexible framework for projecting gVCF format values (including the read depths, strand bias, etc. as well as genotype likelihoods) into the pVCF where the allele set may be different. To further complicate this, the base logic tries to handle combining format values from multiple gVCF records for a sample, which can happen anytime the pVCF site covers >1bp of reference.

In this case the logic is becoming so specialized that I think we probably should just rewrite PLFieldHelper to ignore the base logic (get_out_ind_of_value etc.) for reference bands, and hard-code what we want.

I also want to look ahead to meshing the work on this with plans to implement a proposal for how to write the genotype likelihoods for a multiallelic site ("Local Alleles") without PL's quadratic blowup in k. This will shrink and speed everything up.

I'm tied up the next few days but keeping a mental background thread running on this...

tedyun commented 5 years ago

I've thought a little bit more about giving (a tiny bit of) theoretical justification to the heuristic for using GLs specified for multiple set of genotypes (e.g. T/<*> = {T/C, T/TA} in the above example).

By just working on the definition of the genotype likelihood, it seems that the GL of a set of genotypes, GL(G_1, G_2, ... G_k) is given by the weighted average of the GLs of the individual genotypes GL(G_i), where the weights are given by the prior probably of the particular genotype. Please see the attached screenshot below for detail (apologies for attaching a picture - I wish I can use LaTeX in GitHub comments..).

This means that, if we're going to use a very simple heuristic, it'll probably be better to duplicate the GL values (of the genotype including a <*>) in all of the individual genotypes (since we're taking a weighted average of the individual GLs, not a sum). Concretely that means we're setting GL(T/<*>) = GL(T/C) = GL(T/TA) in the above example. Again, this is a very crude heuristic, but at least in this way we can guarantee that the identity above will always hold.

tedyun commented 5 years ago

And thank you @mlin for sharing the details about how things are implemented now! Looking forward to hearing back from you about the results from your mental background threads :)

mlin commented 5 years ago

Just a keepalive here -- I'm still pushing on another project but trying to get to this asarp. In the meanwhile, the VCF spec proposal for non-quadratic PL is under active discussion. You all have been extremely insightful on this thread and if you have any spare moments to take a look at that too, I'm sure your perspective could be really helpful. @xunjieli @jaredo @tedyun @cmclean