broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.68k stars 587 forks source link

Minimum mapping quality behaves strangely with overlapping read pairs #7124

Open nh13 opened 3 years ago

nh13 commented 3 years ago

I was investigating an issue where the depth is reported lower than expected at a given site. The default value of --minimum-mapping-quality is 20, so I tried 1, 20, and 60. Both values 1 and 60 give higher depth (INFO.DP) than 20, which is very counter-intuitive. The read pairs are overlapping, so I tried the --do-not-correct-overlapping-quality option, which caused this bias to go away. I'd still don't understand why increasing and decreasing the minimum mapping quality makes a difference, but it is likely to do with overlapping read pairs.

$ gatk HaplotypeCaller \
  -I in.bam \
  -L chr7:145945238-145945238 \
  -stand-call-conf 0 \
  --disable-optimizations \
  --force-active -O out.vcf \
  --reference /path/to/ucsc.hg19.fasta \
  --minimum-mapping-quality <value>;
$ gatk --version
...
The Genome Analysis Toolkit (GATK) v4.2.0.0
HTSJDK Version: 2.24.0
Picard Version: 2.25.0

(I tried this 4.1.4.0)

--minimum-mapping-quality 1:

chr7    145945238   .   A   G   7534.06 .   AC=2;AF=1.00;AN=2;DP=247;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=58.06;QD=31.52;SOR=1.050   GT:AD:DP:GQ:PL  1/1:0,239:239:99:7548,716,0

--minimum-mapping-quality 20:

chr7    145945238   .   A   G   267.64  .   AC=1;AF=0.500;AN=2;BaseQRankSum=2.838;DP=14;ExcessHet=3.0103;FS=6.264;MLEAC=1;MLEAF=0.500;MQ=59.06;MQRankSum=0.000;QD=22.30;ReadPosRankSum=2.208;SOR=2.022  GT:AD:DP:GQ:PL  0/1:3,9:12:28:275,0,28

--minimum-mapping-quality 60:

chr7    145945238   .   A   G   7150.06 .   AC=2;AF=1.00;AN=2;DP=224;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=32.06;SOR=1.008   GT:AD:DP:GQ:PL  1/1:0,223:223:99:7164,668,0
test.bam ``` @HD VN:1.6 SO:coordinate @SQ SN:chr1 LN:249250621 @SQ SN:chr2 LN:243199373 @SQ SN:chr3 LN:198022430 @SQ SN:chr4 LN:191154276 @SQ SN:chr5 LN:180915260 @SQ SN:chr6 LN:171115067 @SQ SN:chr7 LN:159138663 @SQ SN:chr8 LN:146364022 @SQ SN:chr9 LN:141213431 @SQ SN:chr10 LN:135534747 @SQ SN:chr11 LN:135006516 @SQ SN:chr12 LN:133851895 @SQ SN:chr13 LN:115169878 @SQ SN:chr14 LN:107349540 @SQ SN:chr15 LN:102531392 @SQ SN:chr16 LN:90354753 @SQ SN:chr17 LN:81195210 @SQ SN:chr18 LN:78077248 @SQ SN:chr19 LN:59128983 @SQ SN:chr20 LN:63025520 @SQ SN:chr21 LN:48129895 @SQ SN:chr22 LN:51304566 @SQ SN:chrX LN:155270560 @SQ SN:chrY LN:59373566 @SQ SN:chrM LN:16571 @RG ID:1 SM:Sample LB:Sample PU:na PL:Illumina @PG PN:bwa ID:bwa CL: @PG PN:MarkDuplicates ID:MarkDuplicates CL: q0 99 chr7 145945113 3 151M = 145945249 287 AGAGATATAAGAGGTTGGGGCACGGAAATAAGGGATCGGGGCACAGAGATATAAGAGGCTGGGGCACGGAAATAAGGGATCGGGGCACAGAGATATAAGAGGCTGGGGCACGGAAATAAGGGATCGGGGCGCAGAGATATAAGAGGTCCGG GGAGAGGGIIIIIGIIIIIIGIIIIGGIIIIIIIIIIIIIIIGGIGIGGGGGIIIIIIIIGIIGIIIIIAGIIIIIIIIGGIAGGG.GGGGIIGGIGIIGGG.AGGIAGGGGAAGGGGGGGGGGAGGGIIGGGGGGAGGAGGGGGG<<
droazen commented 3 years ago

@davidbenjamin Can you comment on this interaction between --minimum-mapping-quality and --do-not-correct-overlapping-quality in HaplotypeCaller?

jamesemery commented 3 years ago

@nh13. This is certainly unexpected behavior but there are a few things to unpack about what you have reported. Based on what you say its likely that you have a lot of reads that happen to overlap. When mates overlap each-other we have a base quality adjustment step where we will halve the BQs for the overlapping bases in both reads and if they disagree about any given base we 0 out both of their qualities. The code prior to 4.2.0.0 had a bug where it was very likely to throw away all (or close to all) the bases in the overlaps if either read had any indels or softclips in their cigars at all (since it would be looking at the wrong bases for comparison)(#6886). When reads have long strings of 0 bases they are more likely to be thrown away by the genotyper or possibly end up as uninformative or trigger the read disqualification code. I would recommend running with 4.2.0.0 for this reason. Whats strange is why on your data you are seeing it get better at 60. Is there something particularly unusual about your alignment/the data you have that could be causing issues in the overlap code?

(my own editorialization) Its probably worth re-evaluating this behavior since it often doesn't achieve the desired result of eliminating bias from a single piece of evidence when perhaps the more "correct" solution is to do something at the genotyping stage where the mates are genotyped together as is the case in M2.

The other comment I would make is that in the HaplotypeCaller right now in order to actually use reads with lower mapping qualities you have to adjust two arguments (--mapping-quality-threshold-for-genotyping and --minimum-mapping-quality) otherwise the MQ threshold you are using should only be applied for the active region discovery and not for the genotyping stage.

nh13 commented 3 years ago

@jamesemery you can see from my log above I am using 4.2.0.0, and a test case (hg19 SAM file) is attached above if you're interested in looking. I did play around with the --mapping-quality-threshold-for-genotyping option but it didn't seem to have an effect, but I am not 100% certain what I exactly did.

jamesemery commented 2 years ago

@nh13 I spent a little time looking at this case. It seems like what happened is an assembly failure (that seems to occur since the stars align on the mq 20 limiting assembly graph vs the mq 60 or the mq 1 assembly graph. All of the variation in the graph is in a dangling head which often causes problem and in this case causes the variant to sometimes not be correctly assembled. This is where the dangling head gets separated from garbage in the 20 threshold graph: Screen Shot 2022-01-25 at 4 13 52 PM

And here is that spot in the 60 threshold graph: Screen Shot 2022-01-25 at 4 16 43 PM

And here it is in the 1 threshold

Screen Shot 2022-01-25 at 4 22 17 PM

graph:

This seems to have caused the two thresholds to assemble different haplotypes after dangling end recovery (since all of these are dangling ends because the assembly engine can't do anything else because there is not enough padding provided) and it just so happens this failed assembly misses the correct haplotype in that 20 threshold graph and we end up throwing away most of the reads as incongruent with assembly as a result which is why the depth drops out so low at that site. This is a pretty rare edge case and I happened to be able to recover the 20 mq threshold variant with reasonable correct coverage by playing with the --min-pruning 4 argument.

In general though this issue might or might not have existed if the bam snippet provided (and especially the calling interval you provided of chr7:145945238-145945238) were not centered on one single point since assembly works best and is most likely to succeed when it has a few hundred bases of padding around the variant in question (typically for a SNP we end up with at least 100 bases of active window plus another 300 bases of padding on either side for assembly) which cuts down on the risk of assembly failures like this one. I'm curious if you observed this behavior on the initial variantcalling run of HaplotypeCaller on this data or only when you try scalpel out the SNP at chr7:145945238 for testing in this experiment.