broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.63k stars 579 forks source link

GVCF of amplicon sequencing shows GQ=0 even with high depth reads in bamout #6100

Open bhanugandham opened 4 years ago

bhanugandham commented 4 years ago

User is seeing non-ref blocks with GQ=0 even though bamout shows reads covering that region. Example: chr1 14155446 . C . . END=14155446 GT:DP:GQ:MIN_DP:PL 0/0: 100:0: 100 :0,0,0

Discussed this at the gatk office hrs and seems like this might be a bug.

Please find below the user report: 1) I'm using 4.1.3.0 2) java -Xmx6g -jar ../gatk-4.1.3.0/gatk-package-4.1.3.0-local.jar HaplotypeCaller -R ~/gatk/hg19.unmasked.fa -G StandardAnnotation -mbq 17 --standard-min-confidence-threshold-for-calling 25 --max-reads-per-alignment-start 0 --disable-read-filter NotDuplicateReadFilter --min-dangling-branch-length 5 --allow-non-unique-kmers-in-ref --kmer-size 30 --kmer-size 10 --kmer-size 15 -ERC GVCF -I read1.sampe.bam -O small.i.vcf -L small.intervals --recover-all-dangling-branches --assembly-region-out test.txt --dont-trim-active-regions --min-assembly-region-size 28

3) The bamout does seem to include the region of interest where GT=0 (see image https://us.v-cdn.net/5019796/uploads/editor/ye/ty6e3xa8vqom.png "")

4) this it the relevant region in the VCF: chr1 14155328 . G . . END=14155401 GT:DP:GQ:MIN_DP:PL 0/0:10:30:10:0,30,385 chr1 14155402 . G A, 414.02 . DP=10;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQandDP=36000,10 GT:AD:DP:GQ:PL:SB 1/1:0,10,0:10:30:428,30,0,428,30,428:0,0,10,0 chr1 14155403 . T . . END=14155440 GT:DP:GQ:MIN_DP:PL 0/0:10:30:10:0,30,426 chr1 14155441 . G . . END=14155596 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0

This is a region if I'm altering the read to include a SNP in the region where there is a GT=0: chr1 14155328 . G . . END=14155401 GT:DP:GQ:MIN_DP:PL 0/0: 100:99:1 00:0,120,1800 chr1 14155402 . G A, 4486.03 . DP=100;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQandDP=360000,100 GT:AD:DP:GQ:PL:SB 1/1:0,100,0: 100:99:4500,301,0,4500,301,4500:0,0,100,0 chr1 14155403 . T . . END=14155445 GT:DP:GQ:MIN_DP:PL 0/0: 100:99: 100:0,120,1800 chr1 14155446 . C . . END=14155446 GT:DP:GQ:MIN_DP:PL 0/0: 100:0: 100:0,0,0 chr1 14155447 . T G, 1829.60 . BaseQRankSum=0.000;DP=100;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQandDP=360000,100;ReadPosRankSum=0.00GT:AD:DP:GQ:PL:SB 0/1:50,50,0: 100:99:1837,0,1837,1988,1988,3976:50,0,50,0 chr1 14155448 . T . . END=14155596 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0

This Issue was generated from your forums

bhanugandham commented 4 years ago

Please find above a screenshot of the

Screen Shot 2019-08-19 at 3 30 39 PM
bhanugandham commented 4 years ago

Please find above a screenshot of the vcf and bamout in the region of interest. As discussed during gatk office hrs @jamesemery will add more details about the reason behind why the GQ=0.

jamesemery commented 4 years ago

Looking at this example I would suspect that the problem is that all of the reads piled up on this site stop at position 14155455. We only see the PL=0,0,0 after position 14155446 in this GVCF. The GQ is determined by the GL calculation, the logic of which lives in ReferenceConfidenceModel.doIndelRefConfCalc(). Looking at that code it looks like it calculates the SNP GL and the Indel GL and chooses the most pessimistic outlook. Unfortunately every read at this site ends at 14155455, thus when we call out to calcNReadsWithNoPlausibleIndelsReads() we hit a snag since all of the reads in the sample have 9 bases left. If we check the documentation for that method we see the following line: Positions <= maxIndelSize from the end of the provided read/ref always return false. This means that every read returns a value of false for indel informativeness since the result is too close to the end, thus we call into getIndelPLs() with a count of 0 informative reads, thus giving us the PL 0,0,0.

This seems like an inherent problem in Amplicon and other similar sequencing technology data.

davidbenjamin commented 4 years ago

I agree with James. When reads don't go far enough past a variant position you really can't be sure that there is no indel. For amplicon data you may choose to lower the --indel-size-to-eliminate-in-ref-model argument from its default of 10, however.

davidbenjamin commented 4 years ago

@bhanugandham I think we can close this issue because it doesn't seem like a bug and because there's a workaround for users who only want SNP (or smaller indel) reference confidence.