Open freeseek opened 5 years ago
Will look into this, seems quite strange.
About the exception on Ubuntu I think that is due to a "low level" Java issue. Are you using the same JDK version? Currently gatk is compiled using Java 8.... I think the latest revision should do.
If "java -version" return something that is not 1.8.??? the that is the probable cause. At any rate this is not a programming bug in GATK. At worst it would be caused by a binary building issue when that binary was compiled and package. If your SDK is 1.8.??? I would suggest that you try to compile and install it from source code instead.
@vruano you are right
The error I get on my Ubuntu 19.04 machine is due to the fact that the version I am using is:
openjdk version "11.0.3" 2019-04-16
OpenJDK Runtime Environment (build 11.0.3+7-Ubuntu-1ubuntu219.04.1)
OpenJDK 64-Bit Server VM (build 11.0.3+7-Ubuntu-1ubuntu219.04.1, mixed mode, sharing)
If I revert to version:
openjdk version "1.8.0_212"
OpenJDK Runtime Environment (build 1.8.0_212-8u212-b03-0ubuntu1.19.04.2-b03)
OpenJDK 64-Bit Server VM (build 25.212-b03, mixed mode)
I can reproduce the bug. The problem has something to do with the two variants being genotyped at the same time. If I try to genotype one variant at a time the issue does not show.
I noticed this as it is completely throwing off a statistic I am computing related to allelic depth.
I can confirm the assembly of the region is different when you change that parameter from 3 (with minBQ 12) to only 2 haplotypes (with minMQ11) assembled...
I guess you may have taken a look into these variances using IGV... which result do you thing matches truth best?
@freeseek
I think you can easily judge from the IGV plot. I don't seem much ambiguity. :-)
I see... The problem is that with minBQ 12 the assembler does not reconstruct the haplotype that contains both variants but two haplotypes that contain one of the variants at a time. As a result reads that would overlap both snp have a 1-bp edition distance agains the reference haplotype and any of the two alternatives. In that case they may seem to support the reference although the Lk difference should be small.
@jamesemery is working in a new version of the assembler that should address some of these issue. I could and se whether the is a quick patch fix for it but if lowering the minBQ does the trick without adding much "crap" as a result you could do that for now.
Thank you @vruano I am not in a hurry to see this fixed soon, but is there a minBQ that would make this class of problems disappear entirely? I would imagine that the same issue could still surface at any minBQ in other places of the genome. Also, ideally I would like to be able to run it with minBQ=20 as my analysis is very sensitive to sequencing errors.
You are right... unfortunately there is no secret pill here... there is zero guarantees tweaking minBQ and that is why we need that new assembler to get rid of all this quirks. ...
In the meantime what could be done is to get the GGA to induce not only each variant alone but also haplotypes with all the possible combos of such variants. The problem of doing this is that if an active region contains many such variants then there are to many such combinations O(2^n).
Fyi with " --allow-non-unique-kmers-in-ref" it seems to work "better". The AD count for the first variant is then something like 3,3? which makes sense since there are actually 3 reads that have the ref allele for the first variant. Also produces the same result with minBQ 20.
You can try to experiment with this argument that tend to increase the number of reconstructed haplotypes.
@freeseek
I think we should wait for improvements from @jamesemery's work on the assembler although I don't know what is the timeline for those.
Using --allow-non-unique-kmers-in-ref
I get, for --min-base-quality-score 11
:
chr1 97329945 T A 1/1 3,33
chr1 97329967 C T 1/1 0,34
and for --min-base-quality-score 12
:
chr1 97329945 T A 1/1 3,33
chr1 97329967 C T 1/1 0,34
So it seems about right. I will give it a try. Thank you!
Good, please post on how well it does, here. I'm curious. @freeseek.
I have another example where with options:
--min-base-quality-score 17
--allow-non-unique-kmers-in-ref
I get
chr12 60339493 A G 1/1 0,14
chr12 60339499 A G 1/1 0,15
chr12 60339510 C T 1/1 0,17
Which is substantially correct.
But with options:
--min-base-quality-score 18
--allow-non-unique-kmers-in-ref
I get
chr12 60339493 A G 0/1 3,3
chr12 60339499 A G 0/1 3,4
chr12 60339510 C T 1/1 0,14
Which again is very very off from the truth.
And this is what the data looks like:
The problem is also present with Mutect2. Is there any solution or further suggestion?
Here how to reproduce the last issue.
First of all, you have to download the mini input.bam file from this dropbox link: https://www.dropbox.com/sh/2xni9oh362ovc71/AAAIm7MZ1tLL6xuUUp7tk2g_a?dl=0
Then the following code will reproduce the issue:
wget https://github.com/broadinstitute/picard/releases/download/2.19.0/picard.jar
wget https://github.com/broadinstitute/gatk/releases/download/4.0.12.0/gatk-4.0.12.0.zip
unzip gatk-4.0.12.0.zip
wget -O- ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz | \
gzip -d > GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
samtools faidx GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
java -jar picard.jar \
CreateSequenceDictionary \
R=GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
O=GCA_000001405.15_GRCh38_no_alt_analysis_set.dict
(echo "##fileformat=VCFv4.2"; \
echo "##contig=<ID=chr12,length=133275309>"; \
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"; \
echo -e "chr12\t60339493\t.\tA\tG\t.\t.\t."; \
echo -e "chr12\t60339499\t.\tA\tG\t.\t.\t."; \
echo -e "chr12\t60339510\t.\tC\tT\t.\t.\t.") | bgzip > input.vcf.gz && \
tabix -f input.vcf.gz
for score in 17 18; do
gatk-4.0.12.0/gatk Mutect2 \
-R GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
-I input.bam \
--tumor SM \
-O output.$score.vcf.gz \
--genotyping-mode GENOTYPE_GIVEN_ALLELES \
--alleles input.vcf.gz \
-L chr12:60339493-60339510 \
--min-base-quality-score $score \
--bam-output output.$score.bam && \
bcftools query \
-f "[%CHROM\t%POS\t%REF\t%ALT\t%GT\t%AD\n]" \
output.$score.vcf.gz \
-r chr12:60339493-60339510
done
With the two outputs being:
chr12 60339493 A G 0/1 0,9
chr12 60339499 A G 0/1 0,8
chr12 60339510 C T 0/1 0,10
and:
chr12 60339493 A G 0/1 6,0
chr12 60339499 A G 0/1 2,5
chr12 60339510 C T 0/1 0,9
And these are the two BAMs with the reconstructed haplotypes I obtained:
However, I am unable to reproduce the issue with GATK 4.1.3.0. This is a bit unfortunate but I am unable to use newer versions of GATK due to a separate AD bug (see https://github.com/broadinstitute/gatk/issues/6096) which forces me to use GATK 4.0.12.0 so it would be great to have a workaround for this.
Actually, the problem only arises when using the option --genotyping-mode GENOTYPE_GIVEN_ALLELES
. As that option was removed after GATK 4.0.12.0, this might be the reason I am unable to reproduce the bug with newer versions of GATK. What exactly does the --genotyping-mode GENOTYPE_GIVEN_ALLELES
option do if the --alleles
option is included?
@freeseek The --genotyping-mode
argument was not removed, and is still present in newer versions of GATK (see: https://github.com/broadinstitute/gatk/blob/master/src/main/java/org/broadinstitute/hellbender/tools/walkers/genotyper/StandardCallerArgumentCollection.java)
The --alleles
options specifies the alleles that should be genotyped when in GENOTYPE_GIVEN_ALLELES
mode.
gatk-4.1.3.0/gatk Mutect2 --genotyping-mode GENOTYPE_GIVEN_ALLELES
gives the error
***********************************************************************
A USER ERROR has occurred: genotyping-mode is not a recognized option
***********************************************************************
I think I am missing something. How do you use the --genotyping-mode GENOTYPE_GIVEN_ALLELES
option with the latest version of GATK Mutect2?
@freeseek Apparently you just need to specify the --alleles
argument to run Mutect2 in GENOTYPE_GIVEN_ALLELES mode these days.
@droazen and @vruano if the new Mutect2 just needs the "--alleles" to genotype variants, then it seems like this bug could be actually fixed in Mutect2 4.1.3.0 (but it is definitely still present in HaplotypeCaller 4.1.3.0). Unfortunately, Mutect2 4.1.3.0 has currently no workaround for https://github.com/broadinstitute/gatk/issues/6096 Any suggestions?
@freeseek I did some looking into the example you posted and I think the main problem is that the --min-base-quality-score
doesn't really do what I think you are looking for, and causes lots of problems when set to a high enough level such that many bases from each read are below the threshold.
This parameter is used to select portions of the read that can be used in the assembly. For a given value of the parameter, the assembly engine looks for consecutive substrings of the read such that all bases are above the quality threshold. For each such run of bases found, if the run is longer than or equal to the k-mer length, the run is k-merized and each k-mer is added to the graph. That means that the quality of read bases other than those at the location of the SNPs you are looking at can have an impact on the graph building process.
For the example you gave, here's a view of of the reads at this locus in IGV. I've set IGV to show all bases and to shade out reads below BQ 18 (these are the gray bases).
If you look at the two left-most SNPs, you can see that there are very few reads that have even a 10-mer that contains both SNPs. I count 3, I think. This is the information that is needed to build a haplotype path in the assembly graph that would contain both of these variant. For a reason I haven't figured out yet, there's actually only one read that makes it into the assembly graph showing this path. Here's a selection from the assembly graph; that path is the leftmost.
Since there's only one read supporting it, that path gets pruned from the graph before genotyping, so the genotyper is left with no haplotype that contains both of those variants and therefore makes mistakes in assigning read likelihoods.
Although I think there's more to dig in and understand here, my summary is that choosing a min base quality so high ends up chopping these reads up into tiny parts, some of which are smaller than the k-mer length and get discarded, so you end up throwing away a lot of the data and not allowing the genotyper to discover the correct haplotypes.
@cwhelan thank you, this is a good explanation. You are right in saying that --min-base-quality-score
doesn't do what I am looking for. The explanation in the tool is:
--min-base-quality-score,-mbq:Byte
Minimum base quality required to consider a base for calling Default value: 10.
Notice that it says for calling
and in the documentation (https://software.broadinstitute.org/gatk/documentation/tooldocs/4.1.3.0/org_broadinstitute_hellbender_tools_walkers_mutect_Mutect2.php) it says:
--min-base-quality-score / -mbq
Minimum base quality required to consider a base for calling
Bases with a quality below this threshold will not be used for calling.
Nowhere it is mentioned that such bases would not be used for aligning the reads, which is a very different concept. I hope at least this lengthy discussion will spur an improvement of the documentation.
In my application I would very much like to see the FORMAT/AD counting only the number of fragments where the evidence for a given SNP is above a certain threshold. Given my understanding of --min-base-quality-score
I assume that there is no way to do that with GATK then. I think a better explanation of the --min-base-quality-score
is warranted though.
In general, if someone is trying to monitor cancer and they have a high coverage BAM for germline data from which they built a highly reliable VCF with genotypes and low coverage BAM from cell-free DNA, how should they use the VCF from high coverage to drive the molecule counting of the low coverage BAM?
I have managed to generate a minimal bam file that reproduces the issue.
First of all, you have to download the mini input.bam file from this dropbox link: https://www.dropbox.com/sh/xae79hanumpireu/AABKo1l4Y-z5G5YLBqSpylRva?dl=0
Then the following code will reproduce the issue:
When the parameter
--min-base-quality-score 11
is used, the GT/AD output is this:When the parameter
--min-base-quality-score 12
is used, the GT/AD output is this:The first output is the output that makes sense. When
--min-base-quality-score 12
is used, it is as if HaplotypeCaller invents some reference allele reads and then uses them to genotype the variant as heterozygous. I have seen issues like this all over the genome.Interestingly, if I run the same code on my own Ubuntu laptop, I get an error instead: