broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.72k stars 594 forks source link

GATK 4.1.2.0 HaplotypeCaller reports nonsense GT/AD when --min-base-quality-score parameter changes #6045

Open freeseek opened 5 years ago

freeseek commented 5 years ago

I have managed to generate a minimal bam file that reproduces the issue.

First of all, you have to download the mini input.bam file from this dropbox link: https://www.dropbox.com/sh/xae79hanumpireu/AABKo1l4Y-z5G5YLBqSpylRva?dl=0

Then the following code will reproduce the issue:

wget https://github.com/broadinstitute/picard/releases/download/2.19.0/picard.jar

wget https://github.com/broadinstitute/gatk/releases/download/4.1.2.0/gatk-4.1.2.0.zip
unzip gatk-4.1.2.0.zip

wget -O- ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz | \
  gzip -d > GCA_000001405.15_GRCh38_no_alt_analysis_set.fna

samtools faidx GCA_000001405.15_GRCh38_no_alt_analysis_set.fna

java -jar picard.jar \
  CreateSequenceDictionary \
  R=GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
  O=GCA_000001405.15_GRCh38_no_alt_analysis_set.dict

(echo "##fileformat=VCFv4.2"; \
echo "##contig=<ID=chrX,length=156040895>"; \
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"; \
echo -e "chr1\t97329945\t.\tT\tA\t.\t.\t."; \
echo -e "chr1\t97329967\t.\tC\tT\t.\t.\t.") | bgzip > input.vcf.gz && \
tabix -f input.vcf.gz

for score in 11 12; do
  gatk-4.1.2.0/gatk HaplotypeCaller \
    -R GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
    -I input.bam \
    -O output.$score.vcf.gz \
    --genotyping-mode GENOTYPE_GIVEN_ALLELES \
    --alleles input.vcf.gz \
    -L chr1:97329945-97329967 \
    --min-base-quality-score $score && \
  bcftools query \
    -f "[%CHROM\t%POS\t%REF\t%ALT\t%GT\t%AD\n]" \
    output.$score.vcf.gz \
    -r chr1:97329945-97329967
done

When the parameter --min-base-quality-score 11 is used, the GT/AD output is this:

chr1    97329945    T   A   1/1 0,35
chr1    97329967    C   T   1/1 0,33

When the parameter --min-base-quality-score 12 is used, the GT/AD output is this:

chr1    97329945    T   A   0/1 9,10
chr1    97329967    C   T   0/1 6,11

The first output is the output that makes sense. When --min-base-quality-score 12 is used, it is as if HaplotypeCaller invents some reference allele reads and then uses them to genotype the variant as heterozygous. I have seen issues like this all over the genome.

Interestingly, if I run the same code on my own Ubuntu laptop, I get an error instead:

Exception in thread "main" java.lang.IncompatibleClassChangeError: Inconsistent constant pool data in classfile for class org/broadinstitute/hellbender/transformers/ReadTransformer. Method lambda$identity$d67512bf$1(Lorg/broadinstitute/hellbender/utils/read/GATKRead;)Lorg/broadinstitute/hellbender/utils/read/GATKRead; at index 65 is CONSTANT_MethodRef and should be CONSTANT_InterfaceMethodRef
    at org.broadinstitute.hellbender.transformers.ReadTransformer.identity(ReadTransformer.java:30)
    at org.broadinstitute.hellbender.engine.GATKTool.makePreReadFilterTransformer(GATKTool.java:345)
    at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:276)
    at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1039)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
    at org.broadinstitute.hellbender.Main.main(Main.java:291)
vruano commented 5 years ago

Will look into this, seems quite strange.

About the exception on Ubuntu I think that is due to a "low level" Java issue. Are you using the same JDK version? Currently gatk is compiled using Java 8.... I think the latest revision should do.

If "java -version" return something that is not 1.8.??? the that is the probable cause. At any rate this is not a programming bug in GATK. At worst it would be caused by a binary building issue when that binary was compiled and package. If your SDK is 1.8.??? I would suggest that you try to compile and install it from source code instead.

freeseek commented 5 years ago

@vruano you are right

The error I get on my Ubuntu 19.04 machine is due to the fact that the version I am using is:

openjdk version "11.0.3" 2019-04-16
OpenJDK Runtime Environment (build 11.0.3+7-Ubuntu-1ubuntu219.04.1)
OpenJDK 64-Bit Server VM (build 11.0.3+7-Ubuntu-1ubuntu219.04.1, mixed mode, sharing)

If I revert to version:

openjdk version "1.8.0_212"
OpenJDK Runtime Environment (build 1.8.0_212-8u212-b03-0ubuntu1.19.04.2-b03)
OpenJDK 64-Bit Server VM (build 25.212-b03, mixed mode)

I can reproduce the bug. The problem has something to do with the two variants being genotyped at the same time. If I try to genotype one variant at a time the issue does not show.

I noticed this as it is completely throwing off a statistic I am computing related to allelic depth.

vruano commented 5 years ago

I can confirm the assembly of the region is different when you change that parameter from 3 (with minBQ 12) to only 2 haplotypes (with minMQ11) assembled...

I guess you may have taken a look into these variances using IGV... which result do you thing matches truth best?

vruano commented 5 years ago

@freeseek

freeseek commented 5 years ago

igv_snapshot I think you can easily judge from the IGV plot. I don't seem much ambiguity. :-)

vruano commented 5 years ago

I see... The problem is that with minBQ 12 the assembler does not reconstruct the haplotype that contains both variants but two haplotypes that contain one of the variants at a time. As a result reads that would overlap both snp have a 1-bp edition distance agains the reference haplotype and any of the two alternatives. In that case they may seem to support the reference although the Lk difference should be small.

@jamesemery is working in a new version of the assembler that should address some of these issue. I could and se whether the is a quick patch fix for it but if lowering the minBQ does the trick without adding much "crap" as a result you could do that for now.

freeseek commented 5 years ago

Thank you @vruano I am not in a hurry to see this fixed soon, but is there a minBQ that would make this class of problems disappear entirely? I would imagine that the same issue could still surface at any minBQ in other places of the genome. Also, ideally I would like to be able to run it with minBQ=20 as my analysis is very sensitive to sequencing errors.

vruano commented 5 years ago

You are right... unfortunately there is no secret pill here... there is zero guarantees tweaking minBQ and that is why we need that new assembler to get rid of all this quirks. ...

In the meantime what could be done is to get the GGA to induce not only each variant alone but also haplotypes with all the possible combos of such variants. The problem of doing this is that if an active region contains many such variants then there are to many such combinations O(2^n).

vruano commented 5 years ago

Fyi with " --allow-non-unique-kmers-in-ref" it seems to work "better". The AD count for the first variant is then something like 3,3? which makes sense since there are actually 3 reads that have the ref allele for the first variant. Also produces the same result with minBQ 20.

You can try to experiment with this argument that tend to increase the number of reconstructed haplotypes.

vruano commented 5 years ago

@freeseek

vruano commented 5 years ago

I think we should wait for improvements from @jamesemery's work on the assembler although I don't know what is the timeline for those.

freeseek commented 5 years ago

Using --allow-non-unique-kmers-in-ref I get, for --min-base-quality-score 11:

chr1    97329945    T   A   1/1 3,33
chr1    97329967    C   T   1/1 0,34

and for --min-base-quality-score 12:

chr1    97329945    T   A   1/1 3,33
chr1    97329967    C   T   1/1 0,34

So it seems about right. I will give it a try. Thank you!

vruano commented 5 years ago

Good, please post on how well it does, here. I'm curious. @freeseek.

freeseek commented 5 years ago

I have another example where with options:

--min-base-quality-score 17
--allow-non-unique-kmers-in-ref

I get

chr12       60339493  A       G       1/1             0,14
chr12       60339499  A       G       1/1             0,15
chr12       60339510  C       T       1/1             0,17

Which is substantially correct.

But with options:

--min-base-quality-score 18
--allow-non-unique-kmers-in-ref

I get

chr12       60339493  A       G       0/1             3,3
chr12       60339499  A       G       0/1             3,4
chr12       60339510  C       T       1/1             0,14

Which again is very very off from the truth.

And this is what the data looks like: image

The problem is also present with Mutect2. Is there any solution or further suggestion?

freeseek commented 5 years ago

Here how to reproduce the last issue.

First of all, you have to download the mini input.bam file from this dropbox link: https://www.dropbox.com/sh/2xni9oh362ovc71/AAAIm7MZ1tLL6xuUUp7tk2g_a?dl=0

Then the following code will reproduce the issue:

wget https://github.com/broadinstitute/picard/releases/download/2.19.0/picard.jar

wget https://github.com/broadinstitute/gatk/releases/download/4.0.12.0/gatk-4.0.12.0.zip
unzip gatk-4.0.12.0.zip

wget -O- ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz | \
  gzip -d > GCA_000001405.15_GRCh38_no_alt_analysis_set.fna

samtools faidx GCA_000001405.15_GRCh38_no_alt_analysis_set.fna

java -jar picard.jar \
  CreateSequenceDictionary \
  R=GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
  O=GCA_000001405.15_GRCh38_no_alt_analysis_set.dict

(echo "##fileformat=VCFv4.2"; \
echo "##contig=<ID=chr12,length=133275309>"; \
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"; \
echo -e "chr12\t60339493\t.\tA\tG\t.\t.\t."; \
echo -e "chr12\t60339499\t.\tA\tG\t.\t.\t."; \
echo -e "chr12\t60339510\t.\tC\tT\t.\t.\t.") | bgzip > input.vcf.gz && \
tabix -f input.vcf.gz

for score in 17 18; do
  gatk-4.0.12.0/gatk Mutect2 \
    -R GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
    -I input.bam \
    --tumor SM \
    -O output.$score.vcf.gz \
    --genotyping-mode GENOTYPE_GIVEN_ALLELES \
    --alleles input.vcf.gz \
    -L chr12:60339493-60339510 \
    --min-base-quality-score $score \
    --bam-output output.$score.bam && \
  bcftools query \
    -f "[%CHROM\t%POS\t%REF\t%ALT\t%GT\t%AD\n]" \
    output.$score.vcf.gz \
    -r chr12:60339493-60339510
done

With the two outputs being:

chr12   60339493    A   G   0/1 0,9
chr12   60339499    A   G   0/1 0,8
chr12   60339510    C   T   0/1 0,10

and:

chr12   60339493    A   G   0/1 6,0
chr12   60339499    A   G   0/1 2,5
chr12   60339510    C   T   0/1 0,9

And these are the two BAMs with the reconstructed haplotypes I obtained: igv_snapshot

However, I am unable to reproduce the issue with GATK 4.1.3.0. This is a bit unfortunate but I am unable to use newer versions of GATK due to a separate AD bug (see https://github.com/broadinstitute/gatk/issues/6096) which forces me to use GATK 4.0.12.0 so it would be great to have a workaround for this.

freeseek commented 5 years ago

Actually, the problem only arises when using the option --genotyping-mode GENOTYPE_GIVEN_ALLELES. As that option was removed after GATK 4.0.12.0, this might be the reason I am unable to reproduce the bug with newer versions of GATK. What exactly does the --genotyping-mode GENOTYPE_GIVEN_ALLELES option do if the --alleles option is included?

droazen commented 5 years ago

@freeseek The --genotyping-mode argument was not removed, and is still present in newer versions of GATK (see: https://github.com/broadinstitute/gatk/blob/master/src/main/java/org/broadinstitute/hellbender/tools/walkers/genotyper/StandardCallerArgumentCollection.java)

The --alleles options specifies the alleles that should be genotyped when in GENOTYPE_GIVEN_ALLELES mode.

freeseek commented 5 years ago
gatk-4.1.3.0/gatk Mutect2 --genotyping-mode GENOTYPE_GIVEN_ALLELES

gives the error

***********************************************************************

A USER ERROR has occurred: genotyping-mode is not a recognized option

***********************************************************************

I think I am missing something. How do you use the --genotyping-mode GENOTYPE_GIVEN_ALLELES option with the latest version of GATK Mutect2?

droazen commented 5 years ago

@freeseek Apparently you just need to specify the --alleles argument to run Mutect2 in GENOTYPE_GIVEN_ALLELES mode these days.

freeseek commented 5 years ago

@droazen and @vruano if the new Mutect2 just needs the "--alleles" to genotype variants, then it seems like this bug could be actually fixed in Mutect2 4.1.3.0 (but it is definitely still present in HaplotypeCaller 4.1.3.0). Unfortunately, Mutect2 4.1.3.0 has currently no workaround for https://github.com/broadinstitute/gatk/issues/6096 Any suggestions?

cwhelan commented 5 years ago

@freeseek I did some looking into the example you posted and I think the main problem is that the --min-base-quality-score doesn't really do what I think you are looking for, and causes lots of problems when set to a high enough level such that many bases from each read are below the threshold.

This parameter is used to select portions of the read that can be used in the assembly. For a given value of the parameter, the assembly engine looks for consecutive substrings of the read such that all bases are above the quality threshold. For each such run of bases found, if the run is longer than or equal to the k-mer length, the run is k-merized and each k-mer is added to the graph. That means that the quality of read bases other than those at the location of the SNPs you are looking at can have an impact on the graph building process.

For the example you gave, here's a view of of the reads at this locus in IGV. I've set IGV to show all bases and to shade out reads below BQ 18 (these are the gray bases).

igv_snapshot

If you look at the two left-most SNPs, you can see that there are very few reads that have even a 10-mer that contains both SNPs. I count 3, I think. This is the information that is needed to build a haplotype path in the assembly graph that would contain both of these variant. For a reason I haven't figured out yet, there's actually only one read that makes it into the assembly graph showing this path. Here's a selection from the assembly graph; that path is the leftmost.

assembly_graph

Since there's only one read supporting it, that path gets pruned from the graph before genotyping, so the genotyper is left with no haplotype that contains both of those variants and therefore makes mistakes in assigning read likelihoods.

Although I think there's more to dig in and understand here, my summary is that choosing a min base quality so high ends up chopping these reads up into tiny parts, some of which are smaller than the k-mer length and get discarded, so you end up throwing away a lot of the data and not allowing the genotyper to discover the correct haplotypes.

freeseek commented 5 years ago

@cwhelan thank you, this is a good explanation. You are right in saying that --min-base-quality-score doesn't do what I am looking for. The explanation in the tool is:

--min-base-quality-score,-mbq:Byte
                              Minimum base quality required to consider a base for calling  Default value: 10. 

Notice that it says for calling and in the documentation (https://software.broadinstitute.org/gatk/documentation/tooldocs/4.1.3.0/org_broadinstitute_hellbender_tools_walkers_mutect_Mutect2.php) it says:

--min-base-quality-score / -mbq
Minimum base quality required to consider a base for calling
Bases with a quality below this threshold will not be used for calling.

Nowhere it is mentioned that such bases would not be used for aligning the reads, which is a very different concept. I hope at least this lengthy discussion will spur an improvement of the documentation.

In my application I would very much like to see the FORMAT/AD counting only the number of fragments where the evidence for a given SNP is above a certain threshold. Given my understanding of --min-base-quality-score I assume that there is no way to do that with GATK then. I think a better explanation of the --min-base-quality-score is warranted though.

In general, if someone is trying to monitor cancer and they have a high coverage BAM for germline data from which they built a highly reliable VCF with genotypes and low coverage BAM from cell-free DNA, how should they use the VCF from high coverage to drive the molecule counting of the low coverage BAM?