broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.71k stars 592 forks source link

MarkDuplicates results in Cannot invoke "htsjdk.samtools.SAMReadGroupRecord.getReadGroupId()" #8904

Closed konopinski closed 4 months ago

konopinski commented 4 months ago

Instructions

The github issue tracker is for bug reports, feature requests, and API documentation requests. General questions about how to use the GATK, how to interpret the output, etc. should be asked on the official support forum.


Bug Report

Affected tool(s) or class(es)

Tool/class name(s), special parameters? MarkDuplicates

Affected version(s)

Description

Describe the problem below. Provide screenshots , stacktrace , logs where appropriate. I'm trying to use gatk for finding snps in exome capture project. I get an error when trying to use MarkDuplicates - I tried using it from picard and from gatk. The screen output is:

picard MarkDuplicates I=WA02_i5-537_i7-98_S11819_L004.bam O=test.dup.bam M=marked_dup_metrics.txt
INFO    2024-07-03 15:25:31     MarkDuplicates

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
**********
https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    MarkDuplicates -I WA02_i5-537_i7-98_S11819_L004.bam -O test.dup.bam -M marked_dup_metrics.txt
**********

15:25:31.262 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/picard/build/libs/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Wed Jul 03 15:25:31 CEST 2024] MarkDuplicates INPUT=[WA02_i5-537_i7-98_S11819_L004.bam] OUTPUT=test.dup.bam METRICS_FILE=marked_dup_metrics.txt    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false FLOW_MODE=false FLOW_DUP_STRATEGY=FLOW_QUALITY_SUM_STRATEGY USE_END_IN_UNPAIRED_READS=false USE_UNPAIRED_CLIPPED_END=false UNPAIRED_END_UNCERTAINTY=0 UNPAIRED_START_UNCERTAINTY=0 FLOW_SKIP_FIRST_N_FLOWS=0 FLOW_Q_IS_KNOWN_END=false FLOW_EFFECTIVE_QUALITY_THRESHOLD=15 ADD_PG_TAG_TO_READS=true REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Wed Jul 03 15:25:31 CEST 2024] Executing as qnick@Barbus on Linux 6.5.0-41-generic amd64; OpenJDK 64-Bit Server VM 17.0.11+9-Ubuntu-122.04.1; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: 3.2.0-1-gd8d87c971-SNAPSHOT
INFO    2024-07-03 15:25:31     MarkDuplicates  Start of doWork freeMemory: 377343072; totalMemory: 402653184; maxMemory: 32178700288
INFO    2024-07-03 15:25:31     MarkDuplicates  Reading input file and constructing read end information.
INFO    2024-07-03 15:25:31     MarkDuplicates  Will retain up to 116589493 data points before spilling to disk.
[Wed Jul 03 15:25:35 CEST 2024] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.07 minutes.
Runtime.totalMemory()=6861881344
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.NullPointerException: Cannot invoke "htsjdk.samtools.SAMReadGroupRecord.getReadGroupId()" because the return value of "htsjdk.samtools.SAMRecord.getReadGroup()" is null
        at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:558)
        at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:270)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:281)
        at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:105)
        at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:115)

or from gatk

gatk MarkDuplicates I=WA02_i5-537_i7-98_S11819_L004.bam O=test.dup.bam M=marked_dup_metrics.txt
Using GATK jar /opt/gatk-4.6.0.0/gatk-package-4.6.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/gatk-4.6.0.0/gatk-package-4.6.0.0-local.jar MarkDuplicates I=WA02_i5-537_i7-98_S11819_L004.bam O=test.dup.bam M=marked_dup_metrics.txt
INFO    2024-07-03 15:26:21     MarkDuplicates

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
**********
https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    MarkDuplicates -I WA02_i5-537_i7-98_S11819_L004.bam -O test.dup.bam -M marked_dup_metrics.txt
**********

15:26:21.393 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/gatk-4.6.0.0/gatk-package-4.6.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Wed Jul 03 15:26:21 CEST 2024] MarkDuplicates INPUT=[WA02_i5-537_i7-98_S11819_L004.bam] OUTPUT=test.dup.bam METRICS_FILE=marked_dup_metrics.txt    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false FLOW_MODE=false FLOW_DUP_STRATEGY=FLOW_QUALITY_SUM_STRATEGY USE_END_IN_UNPAIRED_READS=false USE_UNPAIRED_CLIPPED_END=false UNPAIRED_END_UNCERTAINTY=0 UNPAIRED_START_UNCERTAINTY=0 FLOW_SKIP_FIRST_N_FLOWS=0 FLOW_Q_IS_KNOWN_END=false FLOW_EFFECTIVE_QUALITY_THRESHOLD=15 ADD_PG_TAG_TO_READS=true REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=2 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Wed Jul 03 15:26:21 CEST 2024] Executing as qnick@Barbus on Linux 6.5.0-41-generic amd64; OpenJDK 64-Bit Server VM 17.0.11+9-Ubuntu-122.04.1; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: 4.6.0.0
INFO    2024-07-03 15:26:21     MarkDuplicates  Start of doWork freeMemory: 189357416; totalMemory: 234881024; maxMemory: 32178700288
INFO    2024-07-03 15:26:21     MarkDuplicates  Reading input file and constructing read end information.
INFO    2024-07-03 15:26:21     MarkDuplicates  Will retain up to 116589493 data points before spilling to disk.
[Wed Jul 03 15:26:25 CEST 2024] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.07 minutes.
Runtime.totalMemory()=3774873600
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
java.lang.NullPointerException: Cannot invoke "htsjdk.samtools.SAMReadGroupRecord.getReadGroupId()" because the return value of "htsjdk.samtools.SAMRecord.getReadGroup()" is null
        at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:558)
        at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:270)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:281)
        at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:37)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:166)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:209)
        at org.broadinstitute.hellbender.Main.main(Main.java:306)

BTW I'm not sure what is "stacktrace" so I do not add one.

Steps to reproduce

bam files obtained as follows:

@PG     ID:samtools     PN:samtools     PP:bwa  VN:1.13 CL:samtools view -S -bh /mnt/Dane/Szop/Exome/aligned/sam/WA02_i5-537_i7-98_S11819_L004.sam
@PG     ID:samtools.1   PN:samtools     PP:samtools     VN:1.20 CL:samtools view -h WA02_i5-537_i7-98_S11819_L004.bam

commands were gatk MarkDuplicates I=WA02_i5-537_i7-98_S11819_L004.bam O=test.dup.bam M=marked_dup_metrics.txt picard MarkDuplicates I=WA02_i5-537_i7-98_S11819_L004.bam O=test.dup.bam M=marked_dup_metrics.txt (in the latter picard is alias for "java -jar /path/to/picard.jar"

Expected behavior

I expect the normal action of the function - i.e. output file with marked duplicates

Actual behavior

See above

kockan commented 4 months ago

Could you check your BAM header to see if it contains any read groups? If not, you might need to use AddOrReplaceReadGroups

konopinski commented 4 months ago

That was my case. If anyone encounters the same problem the full solution is: Check if there's @RG (read group) line in the problematic .bam header:

samtools view -H <your bam file> | grep '@RG' 

If nothing is found, add 'read group' by e.g.

samtools addreplacerg -r "@RG\tID:ReadGroup1\tSM:SampleName\tPL:Illumina\tLB:Library.fa" -o <output.bam> <input.bam>

You can set ReadGroup1 and SampleName and Library.fa to something really existing in your dataset (lane number, real sample name and the name of the fasta file). I am not sure about Illumina but it is true for my case so I left it as it is.

By the way - this is my call to anyone giving advice on bioinformatic tools. I really appreciate your time and effort - I really do. But think of your answer as if you were talking to a regular high-school student. Most of us here know Linux and programming as much as it is necessary to answer our biological questions. We do not know most of the IT guy's slang. Please, be more patient and give us more detailed solutions because otherwise you waste your time answering and we waste our time reading it. For us it's biology that matters, and all the bioinformatic tools are just the tools. In the end, you do not need to know how the processor works to use the computer.