broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.71k stars 594 forks source link

Mutect2 is incompatible with Mutect2 VCFs as feature files #8699

Open lculibrk opened 9 months ago

lculibrk commented 9 months ago

Bug Report

Affected tool(s) or class(es)

Mutect2

Affected version(s)

4.4.0.0

Description

Hello GATK team,

I am working on a pipeline that calls variants using Mutect2 and uses those mutations for somatic variant calling in other samples. To achieve this I'm using the -allele flag in Mutect2 to pass the desired positions to genotype to Mutect2. However it appears that Mutect2 will output variants that it then is unable to parse as an -allele file. I believe this is because the REF column is too long.

Steps to reproduce

Try to variant call using an -allele VCF containing this line:

5   283041  .   AGAAGACTCGGGGAGGAGCTGAGGTTCTAGTTTGAGGGTCGTGCACCTGGAGAACTGGACAGGAGCTGATGTTCTAGATTGAGCATCGTACAGCTGAAGACTTGGGGAGGAGCTTATGTTGTTCACTTTGAGGGTCTTTCAGCTGGAGACTCAGGCAGGAGCTGATGTTCTAGTTTGAGGATCTCGTAGCTGCAGAATCAGAGAGGAGCTGATGTTCTAGATTGAGGATCTTGTAGCTACAGACCCATAGAGGAGCTGATGATCTAGATTCAGGGTCATGCAGCT   A   .   .   AS_SB_TABLE=57,54|8,7;DP=126;ECNT=1;MBQ=30,31;MFRL=358,237;MMQ=60,60;MPOS=15;POPAF=7.30;TLOD=3.22   GT:AD:AF:DP:F1R2:F2R1:FAD:SB    0/1:111,15:0.028:126:17,1:18,5:86,9:57,54,8,7

Expected behavior

Mutect2 should either not emit an invalid variant or it should be able to parse it

Actual behavior

java.lang.ArrayIndexOutOfBoundsException: arraycopy: source index -152 out of bounds for byte[278]
    at java.base/java.lang.System.arraycopy(Native Method)
    at java.base/java.util.Arrays.copyOfRange(Arrays.java:3823)
    at org.broadinstitute.hellbender.tools.walkers.annotator.TandemRepeat.getNumTandemRepeatUnits(TandemRepeat.java:54)
    at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.AssemblyRegionTrimmer.trim(AssemblyRegionTrimmer.java:189)
    at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2Engine.callRegion(Mutect2Engine.java:273)
    at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2.apply(Mutect2.java:304)
    at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:200)
    at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:173)
    at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1098)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:149)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:198)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:217)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)
wchukwu commented 1 month ago

Hi. I also encounter this error on gatk v 4.6.0.0. Did you find a workaround or has any progress been made to address this? Thank you!

gokalpcelik commented 1 month ago

Hi all. We do have a PR in the master branch to fix this issue. However keep in mind that the variant you are trying to call is probably way above the size that one can call a simple INDEL. https://github.com/broadinstitute/gatk/pull/6388

ahwanpandey commented 1 month ago

Same error here with gatk v 4.6.0.0. Please help!

lculibrk commented 1 month ago

@ahwanpandey the way I've worked around this in my workflows is to filter very large REF/ALT lengths. However from what I can tell the critical length is inconsistent, somewhere between 100-150 from memory, depending on the variant/data.

ahwanpandey commented 1 month ago

Hi @lculibrk Thank you so much for sharing your workaround! I will give it a go and let you know how I go. Really appreciate it, thanks again.

ahwanpandey commented 1 month ago

@lculibrk worked great! I just filtered for any variants with length >=250bp. Didn't really want to try other thresholds but maybe it only breaks at >300. Anyways, filtering for >=250bp is good enough :)

And if anyone finds it useful, vt has a quick way to filter these long variants:

https://genome.sph.umich.edu/wiki/Vt vt view -f "LEN<250" -h full.VCF.gz > filtered.VCF.gz