broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.71k stars 593 forks source link

Possible bug in GenotypeGVCFs 4.6.0.0, affecting all genotyping applications #9007

Open gevro opened 1 month ago

gevro commented 1 month ago

Hi, It seems that for samples in which a variant was NOT detected in a cohort, that GenotypeGVCFs is putting read depth in the AD and DP FORMAT fields of those samples' gVCF MIN_DP fields, rather than AD and DP fields.

Example - after Combine GVCFs, here are two samples, one with (Sample1) and one without (Sample2) the variant detected:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2

chr18 46641978 . C T,A,G, . . BaseQRankSum=-0.507;DP=51737;ExcessHet=0;MQRankSum=0;RAW_MQandDP=184603201,51279;ReadPosRankSum=0.338;AC=0,0,0,0;AN=0 GT:AD:DP:GQ:MIN_DP:PL:SB ./.:516,917,0,0,0:1433:99:.:18707,0,8863,20253,11609,31862,20253,11609,31862,31862,20253,11609,31862,31862,31862:252,264,458,459 ./.:.:198:99:48:0,99,1307,99,1307,1307,99,1307,1307,1307,99,1307,1307,1307,1307:.

--> You can see that DP for Sample1 and Sample2 are 1433 and 198 respectively. And MIN_DP are '.' and 48 respectively. (Note, I'm not sure what MIN_DP = '.' means).

After GenotypeGVCFs, here are the results:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT BTM-1900-PTA-1-C11_DNA BTM-1900-PTA-1-D11_DNA

chr18 46641978 . C T 484450 . AC=1;AF=0.436;AN=4;BaseQRankSum=-0.507;DP=51737;ExcessHet=112.96;FS=0;InbreedingCoeff=-0.7722;MLEAC=61;MLEAF=0.436;MQ=60;MQRankSum=0;QD=9.73;ReadPosRankSum=0.338;SOR=0.669 GT:AD:DP:GQ:PL 0/1:516,917:1433:99:18707,0,8863 0/0:48,0:48:99:0,99,1307

--> DP here is 1433 for Sample1 (correct) and 48 for Sample2 (INCORRECT). DP for Sample2 should be equal to 198, and AD values are also wrong. It seems that GenotypeGVCFs is pulling from the MIN_DP field, which doesn't make sense.

This seems like a likely (quite serious) bug, unless I'm not understanding something fundamental about how GenotypeGVCFs works.

gevro commented 1 month ago

Perhaps related to #7185.

It is not clear to me why min or median would be options, rather than GenotypeGVCFs simply getting the actual correct DP from each sample's GVCF.

gokalpcelik commented 1 month ago

Hi @gevro This is not a bug but the way large HOMREF reference confidence intervals are emitted via GVCF mode. Containing each base information individually brings a tremendous burden on the GVCF format as well as space concerns are becoming outstanding for each sample if you are working with hundreds of thousands of samples. In fact we have an even more greedy approach used by ReblockGVCF which removes marginal and most probably problematic alleles from the GVCF schema and compresses reference blocks even further to help us genotype millions of samples with less resources.

If you are interested in getting actual AD and DP values at each base you need to activate -ERC BP_RESOLUTION mode.

I hope this helps.

Regards.

gevro commented 1 month ago

Thanks for explaining, but I don't think the current output makes sense. It is outputting wrong information that is misleading.

For example, in the above example the actual DP of Sample2 is 198, but CombineGVCFs is reporting it as 48. It would be better to report '.' than to report a wrong number.

Irrespective of that, I'm not sure why at the end of CombineGVCFs the tool can't quickly pull the correct DP values from the input gvcfs. Computationally, even for millions of samples, that would be trivial.

gokalpcelik commented 1 month ago

Can you provide us the actual GVCF line for that variant site from Sample2 before combining? It will explain things a little better.

gevro commented 1 month ago

I copied that above in the original post, after the line 'Example - after Combine GVCFs' is the output of CombineGVCFs which has the correct DP values. Then GenotypeGVCFs outputs the wrong DP values.

gokalpcelik commented 1 month ago

I am asking for the original GVCF entry from the uncombined Sample2 GVCF file not the one from the combined file. Those 2 are quite different usually.

Let me explain a bit more detailed way.

Reference confidence intervals are marked with a start and end position and DP value provided for that position belongs to the starting point of the interval. Interval also contains a FORMAT/MIN_DP parameter that indicates the minimum depth for that interval from start to end position. If a variant in another sample falls somewhere in the middle of this reference confidence interval, DP value indicated for the start position does not apply therefore CombineGVCFs tool takes MIN_DP as its reference instead of the DP since that DP is not the real DP at the HOMREF position for Sample2 in your case.

This is how data is compressed and expressed in GVCF format. If you wish to have each position to have its own DP value you need to select BP_RESOLUTION mode for GVCF output.

I hope this explains it.

Regards.