broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.7k stars 591 forks source link

Handling low bases in AD calculation #9014

Open pogodina-nadezda opened 2 weeks ago

pogodina-nadezda commented 2 weeks ago

Hi,

I found that even after adding --min-base-quality-score 20 to the HaplotypeCaller command line, low-quality bases are still being counted in the AD field in GVCF.

gatk version: 4.6.0.0 (GATK4 docker image) command: gatk HaplotypeCaller -R hg38.chr17.fna -I chr17.bqsr.hg38.bam -O chr17.g.vcf.gz --dbsnp dbSNP.hg38.vcf.gz -ploidy 2 --max-alternate-alleles 2 --dont-use-soft-clipped-bases --min-base-quality-score 20 --base-quality-score-threshold 20 --minimum-mapping-quality 10 -ERC GVCF

Bases quality in one position (samtools mpileup result): BQSR bam file (input): chr17 3648932 N 47 GaGGcGgcGGcGcGGGcGcccaaaaGGGGGcaGGcGcgcGtcccacg ?5???!'??9+?+999!9!!+55559?999+!?9!!!E+9!!!!5!S BAMOUT: chr17 3648932 N 50 GGGGGGGGGGGGGGGGGGGGcctcaacccccaacggcacccaccagaccg !!?!!!?!!?!?!??5!!!?!!!+55!!!+!55?''+5!?!!++5BBB!I

Line in GVCF: chr17 3648932 rs1555561049 G A,C, 90.64 . BaseQRankSum=-0.419;DB;DP=53;ExcessHet=0.0000;MLEAC=0,1,0;MLEAF=0.00,0.500,0.00;MQRankSum=0.000;RAW_MQandDP=190800,53;ReadPosRankSum=0.352 GT:AD:DP:GQ:PL:SB 0/2:23,7,16,0:46:24:98,24,381,0,241,336,161,402,378,539:20,3,0,23

I expected the DP to be 28 or 20 according to the bamout, and the AD for Cytosine to not exceed 3. Is it expected behaviour that all bases are counted in the AD, regardless of quality?

lbergelson commented 1 week ago

@pogodina-nadezda This is expected behavior but it's unintuitive and not necessarily ideal.
The AD is calculated based on the best match of each read to each possible haplotype during assembly. It's not based on the pileup at the site. So a read that has no quality at a site might still be counted towards that site if it matches the relavent assembled haplotype better than any alternative one during the read scoring phase.

@jamesemery This is the second time in 2 days that we've had questions about this though so it's clearly confusing and maybe not the best solution.