Open gevro opened 1 month ago
Perhaps related to #7185.
It is not clear to me why min or median would be options, rather than GenotypeGVCFs simply getting the actual correct DP from each sample's GVCF.
Hi @gevro This is not a bug but the way large HOMREF reference confidence intervals are emitted via GVCF mode. Containing each base information individually brings a tremendous burden on the GVCF format as well as space concerns are becoming outstanding for each sample if you are working with hundreds of thousands of samples. In fact we have an even more greedy approach used by ReblockGVCF which removes marginal and most probably problematic alleles from the GVCF schema and compresses reference blocks even further to help us genotype millions of samples with less resources.
If you are interested in getting actual AD and DP values at each base you need to activate -ERC BP_RESOLUTION
mode.
I hope this helps.
Regards.
Thanks for explaining, but I don't think the current output makes sense. It is outputting wrong information that is misleading.
For example, in the above example the actual DP of Sample2 is 198, but CombineGVCFs is reporting it as 48. It would be better to report '.' than to report a wrong number.
Irrespective of that, I'm not sure why at the end of CombineGVCFs the tool can't quickly pull the correct DP values from the input gvcfs. Computationally, even for millions of samples, that would be trivial.
Can you provide us the actual GVCF line for that variant site from Sample2 before combining? It will explain things a little better.
I copied that above in the original post, after the line 'Example - after Combine GVCFs' is the output of CombineGVCFs which has the correct DP values. Then GenotypeGVCFs outputs the wrong DP values.
I am asking for the original GVCF entry from the uncombined Sample2 GVCF file not the one from the combined file. Those 2 are quite different usually.
Let me explain a bit more detailed way.
Reference confidence intervals are marked with a start and end position and DP value provided for that position belongs to the starting point of the interval. Interval also contains a FORMAT/MIN_DP parameter that indicates the minimum depth for that interval from start to end position. If a variant in another sample falls somewhere in the middle of this reference confidence interval, DP value indicated for the start position does not apply therefore CombineGVCFs tool takes MIN_DP as its reference instead of the DP since that DP is not the real DP at the HOMREF position for Sample2 in your case.
This is how data is compressed and expressed in GVCF format. If you wish to have each position to have its own DP value you need to select BP_RESOLUTION mode for GVCF output.
I hope this explains it.
Regards.
Hi, It seems that for samples in which a variant was NOT detected in a cohort, that GenotypeGVCFs is putting read depth in the AD and DP FORMAT fields of those samples' gVCF MIN_DP fields, rather than AD and DP fields.
Example - after Combine GVCFs, here are two samples, one with (Sample1) and one without (Sample2) the variant detected:
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2
chr18 46641978 . C T,A,G, . . BaseQRankSum=-0.507;DP=51737;ExcessHet=0;MQRankSum=0;RAW_MQandDP=184603201,51279;ReadPosRankSum=0.338;AC=0,0,0,0;AN=0 GT:AD:DP:GQ:MIN_DP:PL:SB ./.:516,917,0,0,0:1433:99:.:18707,0,8863,20253,11609,31862,20253,11609,31862,31862,20253,11609,31862,31862,31862:252,264,458,459 ./.:.:198:99:48:0,99,1307,99,1307,1307,99,1307,1307,1307,99,1307,1307,1307,1307:.
--> You can see that DP for Sample1 and Sample2 are 1433 and 198 respectively. And MIN_DP are '.' and 48 respectively. (Note, I'm not sure what MIN_DP = '.' means).
After GenotypeGVCFs, here are the results:
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT BTM-1900-PTA-1-C11_DNA BTM-1900-PTA-1-D11_DNA
chr18 46641978 . C T 484450 . AC=1;AF=0.436;AN=4;BaseQRankSum=-0.507;DP=51737;ExcessHet=112.96;FS=0;InbreedingCoeff=-0.7722;MLEAC=61;MLEAF=0.436;MQ=60;MQRankSum=0;QD=9.73;ReadPosRankSum=0.338;SOR=0.669 GT:AD:DP:GQ:PL 0/1:516,917:1433:99:18707,0,8863 0/0:48,0:48:99:0,99,1307
--> DP here is 1433 for Sample1 (correct) and 48 for Sample2 (INCORRECT). DP for Sample2 should be equal to 198, and AD values are also wrong. It seems that GenotypeGVCFs is pulling from the MIN_DP field, which doesn't make sense.
This seems like a likely (quite serious) bug, unless I'm not understanding something fundamental about how GenotypeGVCFs works.