broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.72k stars 594 forks source link

Problems running Mutect2 #6695

Open ashwini06 opened 4 years ago

ashwini06 commented 4 years ago

I have problems running gatk Mutect2.

gatk version

command-line

gatk Mutect2 -R /home/proj/stage/cancer/reference/GRCh37/genome/human_g1k_v37_decoy.fasta -L /home/proj/stage/cancer/reference/target_capture_bed/production/balsamic/gicfdna_3.1_hg1

Error

Using GATK jar /home/proj/bin/conda/envs/D_UMI_APJ/share/gatk4-4.1.8.0-0/gatk-package-4.1.8.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/proj/bin/conda/envs/D_UMI_APJ/share/gatk4-4.1.8.0-0/gatk-package-4.1.8.0-local.jar Mutect2 -R /home/proj/stage/cancer/reference/GRCh37/genome/human_g1k_v37_decoy.fasta -L /home/proj/stage/cancer/reference/target_capture_bed/production/balsamic/gicfdna_3.1_hg19_design.bed -I consensus/concatenated_ACC5611A1_XXXXXX_consensusalign_ss_r2.bam -O mutect2/concatenated_ACC5611A1_XXXXXX_mutect2_unfiltered_ss_r2.vcf.gz
09:39:55.358 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/proj/bin/conda/envs/D_UMI_APJ/share/gatk4-4.1.8.0-0/gatk-package-4.1.8.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Jul 03, 2020 9:39:55 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
09:39:55.559 INFO  Mutect2 - ------------------------------------------------------------
09:39:55.559 INFO  Mutect2 - The Genome Analysis Toolkit (GATK) v4.1.8.0
09:39:55.559 INFO  Mutect2 - For support and documentation go to https://software.broadinstitute.org/gatk/
09:39:55.559 INFO  Mutect2 - Executing as ashwini.jeggari@hasta.scilifelab.se on Linux v3.10.0-1062.4.1.el7.x86_64 amd64
09:39:55.560 INFO  Mutect2 - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_152-release-1056-b12
09:39:55.560 INFO  Mutect2 - Start Date/Time: July 3, 2020 9:39:55 AM CEST
09:39:55.560 INFO  Mutect2 - ------------------------------------------------------------
09:39:55.560 INFO  Mutect2 - ------------------------------------------------------------
09:39:55.560 INFO  Mutect2 - HTSJDK Version: 2.22.0
09:39:55.561 INFO  Mutect2 - Picard Version: 2.22.8
09:39:55.561 INFO  Mutect2 - HTSJDK Defaults.COMPRESSION_LEVEL : 2
09:39:55.561 INFO  Mutect2 - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
09:39:55.561 INFO  Mutect2 - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
09:39:55.561 INFO  Mutect2 - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
09:39:55.561 INFO  Mutect2 - Deflater: IntelDeflater
09:39:55.561 INFO  Mutect2 - Inflater: IntelInflater
09:39:55.561 INFO  Mutect2 - GCS max retries/reopens: 20
09:39:55.561 INFO  Mutect2 - Requester pays: disabled
09:39:55.561 INFO  Mutect2 - Initializing engine
09:39:56.014 INFO  FeatureManager - Using codec BEDCodec to read file file:///home/proj/stage/cancer/reference/target_capture_bed/production/balsamic/gicfdna_3.1_hg19_design.bed
09:39:56.024 INFO  IntervalArgumentCollection - Processing 74592 bp from intervals
09:39:56.032 INFO  Mutect2 - Done initializing engine
09:39:56.044 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/home/proj/bin/conda/envs/D_UMI_APJ/share/gatk4-4.1.8.0-0/gatk-package-4.1.8.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
09:39:56.077 INFO  NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/home/proj/bin/conda/envs/D_UMI_APJ/share/gatk4-4.1.8.0-0/gatk-package-4.1.8.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
09:39:56.139 INFO  IntelPairHmm - Using CPU-supported AVX-512 instructions
09:39:56.139 INFO  IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
09:39:56.139 INFO  IntelPairHmm - Available threads: 36
09:39:56.139 INFO  IntelPairHmm - Requested threads: 4
09:39:56.139 INFO  PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
09:39:56.146 INFO  VectorLoglessPairHMM - Time spent in setup for JNI call : 0.0
09:39:56.146 INFO  PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 0.0
09:39:56.146 INFO  SmithWatermanAligner - Total compute time in java Smith-Waterman : 0.00 sec
09:39:56.148 INFO  Mutect2 - Shutting down engine
[July 3, 2020 9:39:56 AM CEST] org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2 done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=2233991168
htsjdk.samtools.util.RuntimeIOException: File not found: mutect2/concatenated_ACC5611A1_XXXXXX_mutect2_unfiltered_ss_r2.vcf.gz
    at htsjdk.variant.variantcontext.writer.VariantContextWriterBuilder.build(VariantContextWriterBuilder.java:451)
    at htsjdk.variant.variantcontext.writer.VariantContextWriterBuilder.build(VariantContextWriterBuilder.java:415)
    at org.broadinstitute.hellbender.utils.variant.GATKVariantContextUtils.createVCFWriter(GATKVariantContextUtils.java:121)
    at org.broadinstitute.hellbender.engine.GATKTool.createVCFWriter(GATKTool.java:887)
    at org.broadinstitute.hellbender.engine.GATKTool.createVCFWriter(GATKTool.java:841)
    at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2.onTraversalStart(Mutect2.java:262)
    at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1047)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206)
    at org.broadinstitute.hellbender.Main.main(Main.java:292)
Caused by: java.nio.file.NoSuchFileException: mutect2/concatenated_ACC5611A1_XXXXXX_mutect2_unfiltered_ss_r2.vcf.gz
    at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
    at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
    at java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:434)
    at java.nio.file.Files.newOutputStream(Files.java:216)
    at htsjdk.variant.variantcontext.writer.VariantContextWriterBuilder.build(VariantContextWriterBuilder.java:447)
    ... 12 more
avalind commented 4 years ago

Does the mutect2 directory exist in your current dir?

ashwini06 commented 4 years ago

Mutect2 is in conda environment and my working directory is different from that path.

fleharty commented 4 years ago

@ashwini06 Could you post the entire command line you are using, some of it appears to have been cut off.

fleharty commented 4 years ago

@ashwini06 Following up on this to see if you are still experiencing problems. If so, could you post the entire command line?

ashwini06 commented 4 years ago

@fleharty : Thanks for the followup. Sorry I missed your previous reply. Yes, the problem with mutect2 still exists.

gatk4 exists in my conda environment path

$conda list | grep 'gatk'
gatk4                     4.1.8.0          py38h37ae868_0    bioconda

Here is my full command-line

gatk Mutect2 --reference /home/proj/stage/cancer/reference/GRCh37/genome/human_g1k_v37_decoy.fasta --input consensus/concatenated_ACC5611A1_XXXXXX_consensusalign_ds.bam --output mutect2/concatenated_ACC5611A1_XXXXXX_mutect2_unfiltered_ds.vcf.gz

mutect2_err
fleharty commented 4 years ago

@avalind This appears to be a different error from the one you were previously encountering. The current error indicates that there is something wrong with your bam. It appears that there is a mismatch to the size of your insert quality sizes and read size.

Is there a way that you can share your bam?

Also, are you sure that you intend to have insertion and deletion qualities, this is something we haven't been using for a few years now.

ashwini06 commented 4 years ago

@fleharty : You can download the bam file using the shared link.

https://ki.box.com/s/b9fe0854eccclz85vvkktd2qfqquyq71

Also, are you sure that you intend to have insertion and deletion qualities, this is something we haven't been using for a few years now. In my workflow, these bam files were generated using sentieon bwa-mem with the default options. Are there any suggestions on how to run mutect2 successfully on this bam file?

fleharty commented 4 years ago

@ashwini06

This bam appears to be malformed and it fails Picard ValidateSamFile. I think you'll need to examine the earlier stages of your pipeline that produce your bam to ensure you get a correctly formed bam. I'm going to close this ticket now since this doesn't appear to be an issue with Mutect2.

(base) wm462-624:Downloads fleharty$ java -jar $PICARD ValidateSamFile I=concatenated_ACC5611A1_XXXXXX_consensusalign_ds.bam INFO 2020-07-14 11:25:52 ValidateSamFile

** NOTE: Picard's command line syntax is changing.


** For more information, please see: ** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)


** The command line looks like this in the new syntax:


** ValidateSamFile -I concatenated_ACC5611A1_XXXXXX_consensusalign_ds.bam


11:25:52.673 INFO NativeLibraryLoader - Loading libgkl_compression.dylib from jar:file:/Users/fleharty/resources/picard.jar!/com/intel/gkl/native/libgkl_compression.dylib [Tue Jul 14 11:25:52 EDT 2020] ValidateSamFile INPUT=concatenated_ACC5611A1_XXXXXX_consensusalign_ds.bam MODE=VERBOSE MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 SKIP_MATE_VALIDATION=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false [Tue Jul 14 11:25:52 EDT 2020] Executing as fleharty@wm462-624 on Mac OS X 10.15.5 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.20.4-SNAPSHOT WARNING 2020-07-14 11:25:52 ValidateSamFile NM validation cannot be performed without the reference. All other validations will still occur. ERROR: Record 18321, Read name UMI-ATT-GAA-2, Zero-length read without FZ, CS or CQ tag ERROR: Record 26312, Read name UMI-CCT-TTC-1, Zero-length read without FZ, CS or CQ tag ERROR: Record 70755, Read name UMI-CAG-GGA-2, Zero-length read without FZ, CS or CQ tag ERROR: Record 145082, Read name UMI-AAC-ATG-5, Zero-length read without FZ, CS or CQ tag ERROR: Record 181500, Read name UMI-ACT-CTT-1, Zero-length read without FZ, CS or CQ tag ERROR: Record 186837, Read name UMI-CAA-CTC-4, Zero-length read without FZ, CS or CQ tag ERROR: Record 186862, Read name UMI-CGC-GCC-0, Zero-length read without FZ, CS or CQ tag ERROR: Record 186904, Read name UMI-AGG-GTC-0, Zero-length read without FZ, CS or CQ tag ERROR: Record 186919, Read name UMI-CGC-TGC-0, Zero-length read without FZ, CS or CQ tag ERROR: Record 186947, Read name UMI-TAA-TAG-3, Zero-length read without FZ, CS or CQ tag ERROR: Record 186970, Read name UMI-GAG-GCC-1, Zero-length read without FZ, CS or CQ tag ERROR: Record 186972, Read name UMI-TAT-TTC-1, Zero-length read without FZ, CS or CQ tag ERROR: Record 186985, Read name UMI-ACG-TAA-6, Zero-length read without FZ, CS or CQ tag ERROR: Record 186995, Read name UMI-CTT-GCA-1, Zero-length read without FZ, CS or CQ tag ERROR: Record 187006, Read name UMI-CTA-GGG-2, Zero-length read without FZ, CS or CQ tag ERROR: Record 187037, Read name UMI-AGT-CTG-5, Zero-length read without FZ, CS or CQ tag ERROR: Record 187061, Read name UMI-CAT-GGT-2, Zero-length read without FZ, CS or CQ tag ERROR: Record 187074, Read name UMI-AAA-CGT-4, Zero-length read without FZ, CS or CQ tag ERROR: Record 187110, Read name UMI-ACG-TAG-2, Zero-length read without FZ, CS or CQ tag ERROR: Record 187121, Read name UMI-CCG-GCC-3, Zero-length read without FZ, CS or CQ tag ERROR: Record 187154, Read name UMI-CAA-CTG-1, Zero-length read without FZ, CS or CQ tag ERROR: Record 187181, Read name UMI-CGG-GAG-3, Zero-length read without FZ, CS or CQ tag ERROR: Record 187209, Read name UMI-CAA-GTT-5, Zero-length read without FZ, CS or CQ tag ERROR: Record 279812, Read name UMI-ACT-GGT-2, Zero-length read without FZ, CS or CQ tag ERROR: Record 327672, Read name UMI-AGT-CGG-1, Zero-length read without FZ, CS or CQ tag ERROR: Record 367457, Read name UMI-GGA-TTA-6, Zero-length read without FZ, CS or CQ tag ERROR: Record 441607, Read name UMI-AGA-GTC-2, Zero-length read without FZ, CS or CQ tag ERROR: Record 481504, Read name UMI-AAC-TCT-2, Zero-length read without FZ, CS or CQ tag ERROR: Record 481532, Read name UMI-AAT-CAA-8, Zero-length read without FZ, CS or CQ tag ERROR: Record 481722, Read name UMI-ATA-ATT-10, Zero-length read without FZ, CS or CQ tag ERROR: Record 481989, Read name UMI-CGA-CTA-8, Zero-length read without FZ, CS or CQ tag ERROR: Record 482114, Read name UMI-GAG-TAA-8, Zero-length read without FZ, CS or CQ tag ERROR: Record 482150, Read name UMI-GCC-GTA-1, Zero-length read without FZ, CS or CQ tag ERROR: Record 482210, Read name UMI-GGT-TCC-2, Zero-length read without FZ, CS or CQ tag ERROR: Record 482222, Read name UMI-GTA-GTT-2, Zero-length read without FZ, CS or CQ tag ERROR: Record 482251, Read name UMI-GTT-TAC-2, Zero-length read without FZ, CS or CQ tag ERROR: Record 541693, Read name UMI-AGG-GAG-2, Zero-length read without FZ, CS or CQ tag ERROR: Record 763643, Read name UMI-GAG-TAT-5, Zero-length read without FZ, CS or CQ tag ERROR: Record 763881, Read name UMI-AGC-TTT-3, Zero-length read without FZ, CS or CQ tag ERROR: Record 764724, Read name UMI-AAT-ATA-14, Zero-length read without FZ, CS or CQ tag ERROR: Record 764749, Read name UMI-GCT-GTG-1, Zero-length read without FZ, CS or CQ tag ERROR: Record 764766, Read name UMI-AGC-TAG-1, Zero-length read without FZ, CS or CQ tag ERROR: Record 764858, Read name UMI-AGA-GGT-4, Zero-length read without FZ, CS or CQ tag ERROR: Record 764950, Read name UMI-CTT-GCC-2, Zero-length read without FZ, CS or CQ tag ERROR: Record 765124, Read name UMI-CGG-TGT-3, Zero-length read without FZ, CS or CQ tag ERROR: Record 765139, Read name UMI-GGA-GTC-1, Zero-length read without FZ, CS or CQ tag ERROR: Record 765157, Read name UMI-ATA-CTC-3, Zero-length read without FZ, CS or CQ tag ERROR: Record 765213, Read name UMI-AGC-TCT-2, Zero-length read without FZ, CS or CQ tag ERROR: Record 765249, Read name UMI-AAG-GAT-3, Zero-length read without FZ, CS or CQ tag ERROR: Record 765281, Read name UMI-AAG-ACT-8, Zero-length read without FZ, CS or CQ tag ERROR: Record 765385, Read name UMI-CGA-CGT-3, Zero-length read without FZ, CS or CQ tag ERROR: Record 765535, Read name UMI-GGG-TTG-10, Zero-length read without FZ, CS or CQ tag ERROR: Record 765582, Read name UMI-ATG-TAA-6, Zero-length read without FZ, CS or CQ tag ERROR: Record 765607, Read name UMI-CCG-CTA-3, Zero-length read without FZ, CS or CQ tag ERROR: Record 765620, Read name UMI-AAA-ATT-16, Zero-length read without FZ, CS or CQ tag ERROR: Record 765717, Read name UMI-AGG-TAT-8, Zero-length read without FZ, CS or CQ tag ERROR: Record 766523, Read name UMI-GAA-GGA-5, Zero-length read without FZ, CS or CQ tag ERROR: Record 822437, Read name UMI-AGA-CCT-2, Zero-length read without FZ, CS or CQ tag ERROR: Record 936121, Read name UMI-CGA-TTT-2, Zero-length read without FZ, CS or CQ tag ERROR: Record 964359, Read name UMI-ACT-TAA-16, Zero-length read without FZ, CS or CQ tag ERROR: Record 965939, Read name UMI-GCA-GTT-2, Zero-length read without FZ, CS or CQ tag ERROR: Record 965956, Read name UMI-AAA-ATA-37, Zero-length read without FZ, CS or CQ tag ERROR: Record 966315, Read name UMI-CTC-GAG-2, Zero-length read without FZ, CS or CQ tag ERROR: Record 966349, Read name UMI-ACT-GTT-8, Zero-length read without FZ, CS or CQ tag ERROR: Record 966385, Read name UMI-ATT-GCA-10, Zero-length read without FZ, CS or CQ tag ERROR: Record 966397, Read name UMI-ACC-CGG-4, Zero-length read without FZ, CS or CQ tag ERROR: Record 966402, Read name UMI-CAG-TGT-5, Zero-length read without FZ, CS or CQ tag ERROR: Record 966417, Read name UMI-CCG-CCT-5, Zero-length read without FZ, CS or CQ tag ERROR: Record 966450, Read name UMI-CCC-GAT-4, Zero-length read without FZ, CS or CQ tag ERROR: Record 966462, Read name UMI-CCG-TCT-3, Zero-length read without FZ, CS or CQ tag ERROR: Record 966487, Read name UMI-GAT-GTT-4, Zero-length read without FZ, CS or CQ tag ERROR: Record 966491, Read name UMI-GTG-TTG-3, Zero-length read without FZ, CS or CQ tag ERROR: Record 966501, Read name UMI-AGA-ATG-4, Zero-length read without FZ, CS or CQ tag ERROR: Record 966509, Read name UMI-AGT-GGT-1, Zero-length read without FZ, CS or CQ tag ERROR: Record 966514, Read name UMI-ATC-GGA-2, Zero-length read without FZ, CS or CQ tag ERROR: Record 966517, Read name UMI-GAT-TGA-2, Zero-length read without FZ, CS or CQ tag ERROR: Record 966538, Read name UMI-ATA-GGG-23, Zero-length read without FZ, CS or CQ tag ERROR: Record 966542, Read name UMI-GTG-TAG-2, Zero-length read without FZ, CS or CQ tag ERROR: Record 966591, Read name UMI-CCG-TAT-6, Zero-length read without FZ, CS or CQ tag ERROR: Record 966596, Read name UMI-GTT-GTT-3-D2, Zero-length read without FZ, CS or CQ tag ERROR: Record 966613, Read name UMI-ACC-GAC-1, Zero-length read without FZ, CS or CQ tag ERROR: Record 966616, Read name UMI-ACG-TGG-5, Zero-length read without FZ, CS or CQ tag ERROR: Record 966618, Read name UMI-ACT-GGG-11, Zero-length read without FZ, CS or CQ tag ERROR: Record 966620, Read name UMI-ACT-GGG-12, Zero-length read without FZ, CS or CQ tag ERROR: Record 966627, Read name UMI-GGC-TGT-2, Zero-length read without FZ, CS or CQ tag ERROR: Record 966674, Read name UMI-CCT-GTC-2, Zero-length read without FZ, CS or CQ tag ERROR: Record 966699, Read name UMI-CCG-TGA-4, Zero-length read without FZ, CS or CQ tag ERROR: Record 966722, Read name UMI-AGG-TGT-2, Zero-length read without FZ, CS or CQ tag ERROR: Record 966742, Read name UMI-CCG-TCA-5, Zero-length read without FZ, CS or CQ tag ERROR: Record 966752, Read name UMI-GAA-GAT-5, Zero-length read without FZ, CS or CQ tag ERROR: Record 966784, Read name UMI-CCT-TAT-12, Zero-length read without FZ, CS or CQ tag ERROR: Record 966875, Read name UMI-AGG-GGG-10, Zero-length read without FZ, CS or CQ tag ERROR: Record 966887, Read name UMI-AGG-CCG-5, Zero-length read without FZ, CS or CQ tag ERROR: Record 966916, Read name UMI-GCT-TCG-2, Zero-length read without FZ, CS or CQ tag ERROR: Record 966939, Read name UMI-CAA-TGT-8, Zero-length read without FZ, CS or CQ tag ERROR: Record 966989, Read name UMI-GAA-TCA-7, Zero-length read without FZ, CS or CQ tag ERROR: Record 966991, Read name UMI-TAG-TGT-2, Zero-length read without FZ, CS or CQ tag ERROR: Record 967245, Read name UMI-AAG-ATT-8, Zero-length read without FZ, CS or CQ tag ERROR: Record 975151, Read name UMI-ACT-CCC-4, Zero-length read without FZ, CS or CQ tag ERROR: Record 1064783, Read name UMI-GGA-GGT-6, Zero-length read without FZ, CS or CQ tag Maximum output of [100] errors reached. [Tue Jul 14 11:25:59 EDT 2020] picard.sam.ValidateSamFile done. Elapsed time: 0.12 minutes. Runtime.totalMemory()=1450180608 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp

fleharty commented 4 years ago

@avalind I got an e-mail saying that you ran picard and had no errors, but I don't see that comment here.

avalind commented 4 years ago

@fleharty I think you meant to tag @ashwini06 (the creator of this issue). I also received that email, maybe @ashwini06 deleted the comment shortly after posting it?

ashwini06 commented 4 years ago

@fleharty @avalind Sorry, something happened with my previous message. But what I wrote previously was that I couldn't reproduce the same error message using Picard ValidateSamFile.

I tried validating my bam file and I don't see any errors. Even the samtools flagstat option works fine on my bam file. Please find the attached screenshots,

picard flagstst

Do you still think my bam file is malformatted?

PS: @fleharty used Picard version (2.20.4-SNAPSHOT), whereas I used v.2.23.2; for running Picard ValidateSamFile.

avalind commented 4 years ago

Bumping this since I ran into the same error as I was helping QC a colleagues data, running GATK 4.1.8.1 produces the following:

https://www.dropbox.com/s/2uleabl53dmg9y3/Screenshot%202020-07-28%2000.35.45.png

And this is on targeted capture data (Twist custom capture) ran through our core facility's sentieon pipeline, using the 'consensus' reads mapped to 1kg_grch37, using the raw reads works fine. Im not very familiar with sentieons pipelines but the steps to generate the UMI consensus reads are described at https://support.sentieon.com/appnotes/umi/.

At first I though that discrepancy between @fleharty's ValidateSam and yours @ashwini06, could be that in the the newer version of Picard uses an updated version of htsjdk (v 2.23.0), but it's the same version of htsjdk that's included in GATK 4.1.8.1, so it seems unlikely. Walking through the commits between Picard 2.22.8 (the one bundled with GATK 4.1.8.1) and 2.23.2 doesn't (at least at first glance for me) show any commits changing code that could explain the differences in behaviour.

avalind commented 4 years ago

After more digging around it seems that in the case of partial alignment (i.e. hard clipped bases) the BD and BI tags that sentieon just copies from the consensus fastq aren't trimmed to the actual length of the aligned sequence, and thus are to long and it's this that causes problems.

As these are non-standard tags the SAM/BAM format specification doesn't say anything on whether their length must equal the aligned segment of bases, but it clearly doesn't make any sense to have quality data on bases that are not part of the alignment (= hard clipped), so IMHO the solution here would be for Sentieon to fix their tool.

I've written a small utility that trims the BD and BI tags (based on the CIGAR-string) to have the same length as the actual aligned segment of the read, https://github.com/avalind/doppelganger.