dnanexus-rnd / GLnexus

Scalable gVCF merging and joint variant calling for population sequencing projects
Apache License 2.0
142 stars 37 forks source link

allele fequency calculation #247

Open lahage opened 3 years ago

lahage commented 3 years ago

I am using Deepvariant to obtain vcf files for single individuals and merge these files with GLnexus. I am having 2 issues.

1) I observed that a 0/0 is coded if an individual is homozygous reference at a certain position but also if no data was available for the individual at this position. So missing data gets treated as homozygous reference (while it should be ./.).

2) Regarding the allele frequency calculation it appears that both ./. and 0/0 are treated as homozygous reference resulting in allele frequencies which are off (too low with regard to the alt allele).

Could you tell me how to fix this? Thank you very much Laura

mlin commented 3 years ago

Hi!

I observed that a 0/0 is coded if an individual is homozygous reference at a certain position but also if no data was available for the individual at this position. So missing data gets treated as homozygous reference (while it should be ./.).

This shouldn't be happening in general. Perhaps you're hitting some corner case. Can you elaborate on it?

Regarding the allele frequency calculation it appears that both ./. and 0/0 are treated as homozygous reference resulting in allele frequencies which are off (too low with regard to the alt allele).

You can run the output through bcftools to have it add AF=AC/AN instead of AC/2N. The two estimators are good for different things. There were some famous retractions recently stemming from not counting the missing data. (But the latter has its own problems for sure, esp. combining cohorts with different sample prep)

lahage commented 3 years ago

Hi Mike, thank you very much for your answers! I have the following example. It is a extreme case but might help to illustrate the problem.

The data for 8 individuals at a certain position is as follows: Ind1: 1 read aligned (alt) Ind2: 3 reads aligned (alt) Ind3: no reads aligned Ind4: 2 reads aligned (alt) Ind5: 1 read aligned (alt) Ind6: 4 reads aligned (alt) Ind7: 1 read aligned (alt) Ind8: 6 reads alt

The GLneus GT gives me:

Ind1 ./. Ind2 1/1 Ind3 0/0 Ind4 1/1 Ind5 ./. Ind6 1/1 Ind7 ./. Ind8 1/1

and the AF (AC/2N) is 0.5 and AF (AC/AN) is 0.8

It all makes sense to me with the exception of why Ind3 is coded as 0/0 since no reads are aligned for the individual at this position so it should be ./. and thus the AF (AC/AN) should be 1.

I hope this elaboration helps? Cheers Laura

mlin commented 3 years ago

Can you show an excerpt the gVCF file for Ind3 with the relevant and nearby records for the locus? GLnexus should not be calling it 0/0 if there are no data, however, it depends entirely on what the gVCF is telling it.

lahage commented 3 years ago

reads

lahage commented 3 years ago

gvcf

lahage commented 3 years ago

GLnexus vcf

lahage commented 3 years ago

Hi Mike, so I am talking about position 27 023 823. The first picture is the aligned reads at this position, the second is the g.vcf output for Ind3 from deepvariant and the third is the merged vcf output from GLnexus. Cheers, Laura

mlin commented 3 years ago

It looks to me like the source of the 0/0 call at that position 27 023 823 is the last row of the gVCF excerpt you pasted (27023400 END=27024100). GLnexus only changes the gVCF GT in certain limited circumstances, so it's just copying the 0/0 over here even though the record does also say MIN_DP=0.

(I would have preferred to see the gVCF leave that record non-called or omit it entirely. In either of those cases GLnexus would not change it to 0/0 or make up 0/0.)

It's possible to write a custom configuration for GLnexus to set genotyper_config.required_dp > 0, which would make it override the gVCF GT to non-called if fewer reads are reported. However it doesn't do that in the default configuration, and it might have unintended consequences elsewhere.

Better (and advisable in general) would be for anything reading that GT to interpret it in context of the QC fields that are also present and indicate that there's not much to go on there.

Bin-Guan commented 3 years ago

I'm also curious on how Glnexus determines AF from gvcf files. Based on my vcf file output, the formula does not look like to be AF = AC/2N. For example, the count from bcftools is AN=208;AC=8;AC_Het=0;AC_Hom=8;AC_Hemi=0;AF=0.0384615, but Glnexus AF=0.067669. The variant is present in 4 individuals (all homozygous). There are 100 individuals with 0/0, and 29 individuals with ./. for this position. If using AF = AC/2N, then AF = 8/(2*133)=0.030075. The call quality on this position is not ideal, but I think this illustrates the formula question well. Also, does Glnexus output AC? It seems to be missing from my run using deepvariantWGS default config. Thanks.

mlin commented 3 years ago

@Bin-Guan we probably want to double-click on the ./. positions. Can you paste a few representative examples of those entries in the project VCF? (with all the format fields)

Sometimes GLnexus knows more information about the genotypes than it's able to "render" in the output due to VCF representation issues. That information still goes into its AF estimate, this is the likely source of the discrepancy.

GLnexus does not output AC, it would be nice to add that but you can always just run the output through bcftools to get it. The reason for the seemingly basic omissions is how we structure the distributed workflow discussed in the manuscript (not included in this open-source repo), with the way it's sharded no one compute node actually has all of the final output genotypes.

carsonhh commented 10 months ago

I also see the 0/0 instead of ./. issue. It's caused by required_dp: 0 in the config settings (i.e for DeepVariant). I created my own yaml files rather than using the build in preset defaults and set required_dp: 1. It fixes the issue. Second both INFO/AC and INFO/AN are defined in the header by GLnexus despite not being produced anywhere in the output. So technically defining them in the header is a bug. Note this is still true for the latest GLnexus version as of Oct 2023.