broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.7k stars 591 forks source link

Cases where allowNonUniqueKmersInRef allows HaplotypeCaller to rescue sites #2916

Closed droazen closed 4 years ago

droazen commented 7 years ago

@vdauwera commented on Mon Nov 14 2016

This derives from a bug report in GATK 3.5 where we found that using -allowNonUniqueKmersInRef rescued real sites. Several other cases have been reported (see original issue linked below). We'd like to find out whether we can improve the assembly system to pick these things up by default without having to use a special argument, while also not generating a lot of FPs (which it seems this may do).


@chandrans commented on Tue May 10 2016

Bug Report

Affected tool(s)

HaplotypeCaller

Affected version(s)

Latest version 3.5

Description

A user has reported that a SNP that was called with version 3.1 is not not being called in the latest version. For example, version 3.1 produces this call at the site: 5 180041178 . G A,<NON_REF> 1057.77 . BaseQRankSum=-4.298;ClippingRankSum=-1.485;DP=101;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=0.289;ReadPosRankSum=0.914 GT:AD:DP:GQ:PL:SB 0/1:52,49,0:101:99:1086,0,1397,1245,1541,2786:16,36,16,33 But, version 3.5 produces this call at the site: 5 180041178 . G <NON_REF> . . END=180041178 GT:DP:GQ:MIN_DP:PL 0/0:104:0:104:0,0,168 Looking at the BAM file, there are indeed many high quality reads/bases that support the SNP.

screen shot 2016-05-10 at 2 50 47 pm

If we zoom out:

screen shot 2016-05-10 at 2 51 31 pm

Then, the bamout file looks like this:

screen shot 2016-05-10 at 2 53 42 pm

No active region there.

However, the bamout file from 3.1 looks like this:

screen shot 2016-05-10 at 2 56 08 pm

Notice the position is part of an active region, but all the reads have MQ0! Yet, somehow this is confidently called as a SNP site.

Last thing, I checked the graphs from both versions and both look exactly the same.

Steps to reproduce

Commands and test files below.

Expected behavior

The SNP should be called in version 3.5 and above.

Actual behavior

The SNP is not called.


Original Forum Post


@chandrans commented on Tue May 10 2016

Test files here: /humgen/gsa-scr1/schandra/jhomsy_MissingVariantInFather/homsy-missing-variant/snippet

Command for 3.5: java -jar /humgen/gsa-hpprojects/GATK/bin/current/GenomeAnalysisTK.jar -T HaplotypeCaller -R /humgen/gsa-hpprojects/GATK/bundle/2.8/b37/human_g1k_v37.fasta -I /humgen/gsa-scr1/schandra/jhomsy_MissingVariantInFather/homsy-missing-variant/snippet/1-03174-02.snippet.bam -L 5:180040878-180041478 -o Sheila.g.vcf -bamout Sheila.g.bam -ERC GVCF -graph Sheila.g.dot

Command for 3.1: java -jar /humgen/gsa-hpprojects/GATK/bin/GenomeAnalysisTK-3.1-1-g07a4bf8/GenomeAnalysisTK.jar -T HaplotypeCaller -R /humgen/gsa-hpprojects/GATK/bundle/2.8/b37/human_g1k_v37.fasta -I /humgen/gsa-scr1/schandra/jhomsy_MissingVariantInFather/homsy-missing-variant/snippet/1-03174-02.snippet.bam -L 5:180040878-180041478 -o Sheila.3.1.g.vcf -bamout Sheila.3.1.g.bam -ERC GVCF -variant_index_type LINEAR -variant_index_parameter 128000 -graph Sheila.3.1.g.dot


@chandrans commented on Tue May 10 2016

Oops. I'm removing the public tag because this user said the data is private. I have screenshots that show large regions.


@ldgauthier commented on Tue May 10 2016

I reviewed a lot of these for the Daly Lab and there are a fair number of differences between 3.1 and 3.4 (our latest production version) but in his case, most of his 3.1 calls that were missing looked like artifacts anyway.


@chandrans commented on Tue May 10 2016

Do you think this is an artifact too?


@sooheelee commented on Wed May 11 2016

The PL's of the v3.4 output are interesting at 0,0,168, hence the GQ 0, and indicate equal weight for hom-ref and some het genotype. Would GenotypeGVCFs be able to pick up on this and eventually call the het genotype given evidence for the A alt allele at the site from other samples?

GT:DP:GQ:MIN_DP:PL 0/0:104:0:104:0,0,168

On Tue, May 10, 2016 at 3:50 PM chandrans notifications@github.com wrote:

Do you think this is an artifact too?

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/broadinstitute/gsa-unstable/issues/1360#issuecomment-218270741


@sooheelee commented on Wed May 11 2016

I just peaked into one of the BAMs (Sheila.3.1.bam).

First, let's differentiate MAPQ from MQ and AS. MAPQ is a SAM record feature, in column 5, and the AS is the alignment score assigned by the aligner as a tag in the format AS:i:#. MQ is the tag that notes the mapping quality of the mate and is added by MergeBamAlignment (so far as I know).

Here's an example of MAPQ=0 (second record):

-bash-3.2$ samtools view Sheila.3.1.bam | grep
'K00162:14:H32NYBBXX:2:1107:10125:22061'
K00162:14:H32NYBBXX:2:1107:10125:22061    99    5    180040981    *60*
48M    =    180041078    241
AGTCCCTTACTCCAGCAGGGCCGGTCACGTAACCTGCCGCCAGTGACC
>>=>>>>===>>>>>>>>>>>?<>>??><>>?>?????<??@@?@??@    HC:i:-1036844999
MC:Z:101M    BD:Z:MNMMKHKJJLMHKMNONNKGLMJKLMMILMMJKLKNONKLNNOOJONN
MD:Z:63G6T30    PG:Z:MarkDuplicates.3.7    RG:Z:1
BI:Z:LOMOJHMLMMOKJMPOOPLINMJLNOOMOPOMNONOONLOOORPORPQ    NM:i:2    *MQ:i:60*
*AS:i:91 *   XS:i:19
K00162:14:H32NYBBXX:2:1107:10125:22061    147    5    180041116    *0    *
63M    =    180040938    -241
TCTGCGTGGTGTACACCTTGTCGAAGATGCTTTCAGGGGCCATCCACTTCAGGGGCAGCCGGG
:???<>??>?>>>?>????>><>=>>=>>>>>=>>>>>>>>==>>=>>=>>>>>>>>?>9<>=
HC:i:-369427419    MC:Z:101M
BD:Z:IONKLIMLIILMIILKJJILMLJJHMMNOJBIMNKHHMLLMMJLINJIMNLIINOPPONKIKK
MD:Z:101    PG:Z:MarkDuplicates.3.7    RG:Z:1
BI:Z:LQPNNLMNLKMNLLNLLLKMNLKLLMNNNKDJNMLGGLMMNOJMLNLKONMHINOORPNKIMM
NM:i:0    *MQ:i:60 *   *AS:i:101 *   XS:i:0
-bash-3.2$

In terms of MAPQ being zero, this happens for reads with multiple valid mappings. That doesn't mean it's a bad mapping, and in fact it can be a great mapping as you can see for the example read above in the AS tag.

REVISED

So MAPQ indicates global mapping and AS measures local mapping score. If one mapping site contains a variant and the other does not, then calling variants for each mapped site is not a good idea. I don't know how supplementary reads are differentiated (MAPQ?--I can look into this), since the way I learned how to run bwa mem asks that all supplementary alignments be treated as secondary alignments (with the -M option). It seems important to confirm whether these supplementary alignments that get flagged secondary (with the `-M) also get MAPQ of 0 or have other nonzero MAPQs. We want our tools, including HaplotypeCaller, to differentiate supplementary alignments and secondary alignments and use supplementary alignments in variant discovery.

Secondary alignments are meant for multimappers (multiple valid mapping locations) and supplementary alignments are meant for chimeric reads (say two records for the same read where one half aligns to the left and the other half aligns to the right of a very large deletion against the reference). This means that we should run bwa mem without the -M option.

Ok, so I'm going to resume thinking HaplotypeCaller filters on MAPQ of 20.


@sooheelee commented on Wed May 11 2016

The implications of this is that (I think) for any workflow that cares about detecting indels in the size range that BWA-MEM would create supplementary alignment records for would then require that we run BWA-MEM without the -M option that we currently recommend. We want both types of mappings.


@vdauwera commented on Wed May 11 2016

For anyone confused about the difference between secondary and supplementary alignments: http://seqanswers.com/forums/showthread.php?t=40239

Currently we actually don't want our variant calling tools to distinguish them -- we prefer to consider them unusable. My understanding is that the size of events that lead to supplementary alignments fall into the scope of structural variation, and any reads that map across them are considered suspect.

So anyway for this user's case it sounds like the difference between versions is that 3.1 was making a call based on data that we (no longer?) consider as receivable evidence. So we should be satisfied with the call made by 3.5.

@ldgauthier would you say that is a fair summary?


@ldgauthier commented on Thu May 12 2016

In 3.1 we were still displaying uninformative reads as being MQ0 in the bamout -- that's a red herring. There is probably one more event in those haplotypes that the MQ0 reads don't span. (I'm assuming blues are haplotypes, but it's hard to tell from this scale.) Note that the annotation for the 3.1 call says MQ=60.

I'm very suspicious of the large deletion that gets introduced after the SNP in question in the 3.1 bamout. I suspect that might be related to the GQ0 ref call in 3.5. Can I get some more info on that? (e.g. end position, zoomed out screenshot, etc.)


@chandrans commented on Tue May 31 2016

Okay. I met the user at last week's MPG and I told him I would put in a good word for him :) He came back today confirming that the SNP is real by Sanger sequencing.

As for your questions Laura, the deletion present in the artificial haplotypes and some of the reads does not get called in 3.1 or in 3.5.

Here is a zoomed out screenshot of the bamout file from 3.1.

screen shot 2016-05-31 at 4 56 20 pm

The first green SNP on the left is the one in question.


@chandrans commented on Tue Jun 14 2016

Figured out at Support meeting that the variant SNP is called when you include -allowNonUniqueKmersInRef in the command.

It seems the kmer including the SNP is quite common the region.

I am going to tell the user about using the flag. However, I think David will take a look into the code to see what exactly is going on and whether it is a good idea to recommend using the flag in repeat regions.


@ldgauthier commented on Wed Jun 15 2016

Valentin has found that that arg is able to recover a lot of our missed indels in the pseudo-diploid truth data, so it's worth investigating. However, I believe when I tried it for MuTect2 against the LUAD data I introduced a not insignificant number of additional variants, likely false positives.

On Tue, Jun 14, 2016 at 4:06 PM, chandrans notifications@github.com wrote:

Figured out at Support meeting that the variant SNP is called when you include -allowNonUniqueKmersInRef in the command.

It seems the kmer including the SNP is quite common the region.

I am going to tell the user about using the flag. However, I think David will take a look into the code to see what exactly is going on and whether it is a good idea to recommend using the flag in repeat regions.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gsa-unstable/issues/1360#issuecomment-225999879, or mute the thread https://github.com/notifications/unsubscribe/AGRhdLY6oKBXcfkv1qDM2idIzsjnU4taks5qLwm1gaJpZM4IbaRr .


@chandrans commented on Mon Aug 29 2016

Two more cases where --allowNonUniqueKmersInRef allows a missed call to be made:

1) https://github.com/broadinstitute/dsde-docs/issues/1211

2) https://github.com/broadinstitute/dsde-docs/issues/1167


@vdauwera commented on Mon Nov 14 2016

Moving this to GATK4 for further investigation of how to deal with these cases where -allowNonUniqueKmersInRef rescues real sites.


@chandrans commented on Wed Nov 16 2016

@SHuang-Broad Are you on this?

davidbenjamin commented 4 years ago

@jamesemery is taking care of this.