broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.68k stars 587 forks source link

HaplotypeCaller in 4.0.11 in GENOTYPE_GIVEN_ALLELES does not report SNP alleles if overlapping an indel in the data. #5385

Closed sicotte closed 5 years ago

sicotte commented 5 years ago

Haplotype Caller still has some problems with GENOTYPE_GIVEN_ALLELES mode.(using the latest build 4.0.11, donwnloaded today)

Summary: In this case I have a SNP position that I want the genotype reported on. However, It overlaps an indel in the data..

More Details: I have a list of variants I want called on a whole genome .. and I also want all novel variants. 1) Run Haplotype Caller in "discovery" mode. 2) Merge that with my list of wanted variants.. and get this first file. grep -v '#' -c /research/bsi/projects/pharmacogenetics/s206438.CYP2D6/processing/RUNS/WGS/RUNGRCh38/GATK/tmp/NA18565.variants.vcf 241 3) I then run the "GENOTYPE_GIVEN_ALLELE" command below The resulting file is missing a lot of variants. grep -v '#' -c /research/bsi/projects/pharmacogenetics/s206438.CYP2D6/processing/RUNS/WGS/RUNGRCh38/GATK/tmp/NA18565.vcf 240

gawk -F "\t" 'BEGIN{while(getline<"/research/bsi/projects/pharmacogenetics/s206438.CYP2D6/processing/RUNS/WGS/RUNGRCh38/GATK/tmp/NA18565.vcf"){if(! ($0 ~ /^#./)) {HAVE[$1 ":" $2 ":" $4 ":" $5]=1}}} $0 ~ /^#./{next}HAVE[$1 ":" $2 ":" $4 ":" $5]==1{next}{print $0}' /research/bsi/projects/pharmacogenetics/s206438.CYP2D6/processing/RUNS/WGS/RUNGRCh38/GATK/tmp/NA18565.variants.vcf

gives me 6 SNPs.. I then grep back in the input file and found that all but 1 were actually tri-allelic (good job GATK)

So Only 1 was missing. 22 42132027 -1241A>G C T . PASS DBSNP=rs867981442;ID=-1241T>C

I also ran GATK in discovery mode, and the only overlap was

this "wanted" variant 22 42132027 -1241A>G C T . PASS DBSNP=rs867981442;ID=-1241T>C

Overlaps these "discovered" variants. 22 42132025 . T TCC 54.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.405;DP=35;ExcessHet=3.0103;FS=7.228;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=2.28;ReadPosRankSum=0.491;SOR=2.419 GT:AD:DP:GQ:PL 0/1:16,8:24:92:92,0,663 22 42132026 . T C 593.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.564;DP=37;ExcessHet=3.0103;FS=11.504;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=17.99;ReadPosRankSum=1.516;SOR=0.069 GT:AD:DP:GQ:PL 0/1:17,16:33:99:622,0,665 22 42132027 . CTTTTTTTTTTT C,TTTTTTTTTTTT 1393.73 . AC=1,1;AF=0.500,0.500;AN=2;DP=36;ExcessHet=3.0103;FS=0.000;MLEAC=1,1;MLEAF=0.500,0.500;MQ=60.00;QD=28.08;SOR=1.542GT:AD:DP:GQ:PL 1/2:0,16,17:33:99:1431,682,665,684,0,621

Full COmmand of the GENOTYPE_GIVEN_ALLELES

/research/bsi/projects/pharmacogenetics/s206438.CYP2D6/processing/tools/gatk-4.0.11.0/gatk --java-options '-Xmx24g -Xms24g' HaplotypeCaller -R /research/bsi/projects/pharmacogenetics/s206438.C YP2D6/processing/RUNS/WGS/GRCh38_nochr.evenM.fa -I /research/bsi/projects/pharmacogenetics/s206438.CYP2D6/processing/RUNS/WGS/GRCh38/NA18565.GRCh38.sorted.bam --genotyping-mode GENOTYPE_GIVEN_AL LELES --output-mode EMIT_ALL_SITES -ERC NONE --standard-min-confidence-threshold-for-calling 0 -O /research/bsi/projects/pharmacogenetics/s206438.CYP2D6/processing/RUNS/WGS/RUNGRCh38/GATK/tmp/NA 18565.vcf --alleles /research/bsi/projects/pharmacogenetics/s206438.CYP2D6/processing/RUNS/WGS/RUNGRCh38/GATK/tmp/NA18565.variants.vcf -L 22:42123190-42132600
Using GATK jar /research/bsi/projects/pharmacogenetics/s206438.CYP2D6/processing/tools/gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar

p.s. This was the "discovery" command

/research/bsi/projects/pharmacogenetics/s206438.CYP2D6/processing/tools/gatk-4.0.11.0/gatk --java-options '-Xmx24g -Xms24g' HaplotypeCaller -R /research/bsi/projects/pharmacogenetics/s206438.C YP2D6/processing/RUNS/WGS/GRCh38_nochr.evenM.fa -I /research/bsi/projects/pharmacogenetics/s206438.CYP2D6/processing/RUNS/WGS/GRCh38/NA18565.GRCh38.sorted.bam --genotyping-mode DISCOVERY --outpu t-mode EMIT_VARIANTS_ONLY -ERC NONE --standard-min-confidence-threshold-for-calling 10 -O /research/bsi/projects/pharmacogenetics/s206438.CYP2D6/processing/RUNS/WGS/RUNGRCh38/GATK/tmp/NA18565.di scovery.vcf -L 22:42123190-42132600
Using GATK jar /research/bsi/projects/pharmacogenetics/s206438.CYP2D6/processing/tools/gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx24g -Xms24g -jar /research /bsi/projects/pharmacogenetics/s206438.CYP2D6/processing/tools/gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar HaplotypeCaller -R /research/bsi/projects/pharmacogenetics/s206438.CYP2D6/processing/ RUNS/WGS/GRCh38_nochr.evenM.fa -I /research/bsi/projects/pharmacogenetics/s206438.CYP2D6/processing/RUNS/WGS/GRCh38/NA18565.GRCh38.sorted.bam --genotyping-mode DISCOVERY --output-mode EMIT_VARIA NTS_ONLY -ERC NONE --standard-min-confidence-threshold-for-calling 10 -O /research/bsi/projects/pharmacogenetics/s206438.CYP2D6/processing/RUNS/WGS/RUNGRCh38/GATK/tmp/NA18565.discovery.vcf -L 22 :42123190-42132600

Instructions

The github issue tracker is for bug reports, feature requests, and API documentation requests. General questions about how to use the GATK, how to interpret the output, etc. should be asked on the official support forum.


Bug Report

Affected tool(s) or class(es)

Tool/class name(s), special parameters?

Affected version(s)

Description

Describe the problem below. Provide screenshots , stacktrace , logs where appropriate.

Steps to reproduce

Tell us how to reproduce this issue. If possible, include command lines that reproduce the problem. (The support team may follow up to ask you to upload data to reproduce the issue.)

Expected behavior

Tell us what should happen

Actual behavior

Tell us what happens instead


Feature request

Tool(s) or class(es) involved

Tool/class name(s), special parameters?

Description

Specify whether you want a modification of an existing behavior or addition of a new capability. Provide examples, screenshots, where appropriate.


Documentation request

Tool(s) or class(es) involved

Tool/class name(s), parameters?

Description

Describe what needs to be added or modified.


cwhelan commented 5 years ago

Hello @sicotte,

I'm not quite sure I understood all of your steps so please forgive me if I've misinterpreted what you are saying. It sounds like the C->T variant at position 22:42132027 is the one that you believe is missing from the final output? However, it looks like GATK did discover that variant as well, in this line:

22 42132027 . CTTTTTTTTTTT C,TTTTTTTTTTTT 1393.73 . AC=1,1;AF=0.500,0.500;AN=2;DP=36;ExcessHet=3.0103;FS=0.000;MLEAC=1,1;MLEAF=0.500,0.500;MQ=60.00;QD=28.08;SOR=1.542GT:AD:DP:GQ:PL 1/2:0,16,17:33:99:1431,682,665,684,0,621

In this variant the ref allele is CTTTTTTTTTTT and the alts are C and TTTTTTTTTTTT. The first alt is a deletion of 11 Ts after the C at position 42132027. The second alt represents the SNP you are talking about, where the C has become T at position 42132027 but does not have the deletion. Since the SNP and the INDEL are represented on the same line, the ref and alt alleles for the SNP are padded with the poly-Ts that take part in the deletion allele. Were all of these alleles present in the output of GGA mode? In that case I think that HaplotypeCaller is doing the right thing.

This compression of SNP and INDELs at the same position into a single variant record is by design, so that GGA mode can genotype all of the alternate alleles simultaneously, even if they were represented by different records in the input VCF.

If I've misunderstood and this variant did not appear at all in the output of GGA mode, or if you're talking about a different variant, could you please respond with the exact contents of your --alleles file for any variants with alleles that overlap a variant you think is missing from the final output, and the output of GGA mode at those positions? Thanks!

sicotte commented 5 years ago

You understand the issue right.

I didn’t see that the C->T allele was actually represented as a C(11T) -> (12T)

I guess there is a philosophical disagreement on wether the C->T should be reported separately or merged with The other allele at the other position..

Today, I want the C->T separately .. but some other days I might not.

From: cwhelan [mailto:notifications@github.com] Sent: Monday, November 05, 2018 8:24 AM To: broadinstitute/gatk Cc: Sicotte, Hugues, Ph.D.; Mention Subject: [EXTERNAL] Re: [broadinstitute/gatk] HaplotypeCaller in 4.0.11 in GENOTYPE_GIVEN_ALLELES does not report SNP alleles if overlapping an indel in the data. (#5385)

Hello @sicottehttps://github.com/sicotte,

I'm not quite sure I understood all of your steps so please forgive me if I've misinterpreted what you are saying. It sounds like the C->T variant at position 22:42132027 is the one that you believe is missing from the final output? However, it looks like GATK did discover that variant as well, in this line:

22 42132027 . CTTTTTTTTTTT C,TTTTTTTTTTTT 1393.73 . AC=1,1;AF=0.500,0.500;AN=2;DP=36;ExcessHet=3.0103;FS=0.000;MLEAC=1,1;MLEAF=0.500,0.500;MQ=60.00;QD=28.08;SOR=1.542GT:AD:DP:GQ:PL 1/2:0,16,17:33:99:1431,682,665,684,0,621

In this variant the ref allele is CTTTTTTTTTTT and the alts are C and TTTTTTTTTTTT. The first alt is a deletion of 11 Ts after the C at position 42132027. The second alt represents the SNP you are talking about, where the C has become T at position 42132027 but does not have the deletion. Since the SNP and the INDEL are represented on the same line, the ref and alt alleles for the SNP are padded with the poly-Ts that take part in the deletion allele. Were all of these alleles present in the output of GGA mode? In that case I think that HaplotypeCaller is doing the right thing.

This compression of SNP and INDELs at the same position into a single variant record is by design, so that GGA mode can genotype all of the alternate alleles simultaneously, even if they were represented by different records in the input VCF.

If I've misunderstood and this variant did not appear at all in the output of GGA mode, or if you're talking about a different variant, could you please respond with the exact contents of your --alleles file for any variants with alleles that overlap a variant you think is missing from the final output, and the output of GGA mode at those positions? Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/broadinstitute/gatk/issues/5385#issuecomment-435892883, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AB3DCwaj5qj18kW_rYV-DjEkN7YuMzCFks5usEn7gaJpZM4YLbo2.

cwhelan commented 5 years ago

I can definitely appreciate that in some cases downstream analysis might be made easier if the original representations of GGA mode alleles were preserved.

Internally, HaplotypeCaller has to convert all variants at a position to share the same reference context so that read alignments can be compared to all possible alternate alleles simultaneously, and it would be a complex and error-prone process to re-map the unified alleles back to their original representations, and would also pose problems in terms of computing the correct values for INFO field annotations such as DP if the output VCF had to be split across multiple lines according to the how things were specified in the input files.

I'm going to close this for now unless you strongly object or have an additional case that shows an error in GGA mode output. It's possible that in the future we could implement an enhancement in the form of a mode that preserves the GGA input allele representations, possibly under some stricter conditions upon the input, but that would likely be a tricky implementation. Pinging @ldgauthier to make sure she agrees.