MarkDuplicates results in Cannot invoke "htsjdk.samtools.SAMReadGroupRecord.getReadGroupId()"

konopinski commented 4 months ago

Instructions

The github issue tracker is for bug reports, feature requests, and API documentation requests. General questions about how to use the GATK, how to interpret the output, etc. should be asked on the official support forum.

Search the existing github issues to see if your issue (or something similar) has already been reported. If the issue already exists, you may comment there to inquire about the progress.
Determine whether your issue is a bug report, a feature request, or a documentation request (for tool/class javadoc only -- for forum docs please post there)
Consider if your "issue" is better addressed on the GATK forum: http://gatkforums.broadinstitute.org/gatk Post there if you have questions about expected tool behavior, output format, unexpected results, or generally any question that does not fit into the categories above
Use a concise yet descriptive title
Choose the corresponding template block below and fill it in, replacing or deleting text in italics (surrounded by _) as appropriate
Delete the other template blocks and this header.

Bug Report

Affected tool(s) or class(es)

Tool/class name(s), special parameters? MarkDuplicates

Affected version(s)

[ ] Latest public release version [version?] 4.6.0.0 GATK and Picard 3.2.0
[ ] Latest master branch as of [date of test?] 3 Jul 2023

Description

Describe the problem below. Provide screenshots , stacktrace , logs where appropriate. I'm trying to use gatk for finding snps in exome capture project. I get an error when trying to use MarkDuplicates - I tried using it from picard and from gatk. The screen output is:

picard MarkDuplicates I=WA02_i5-537_i7-98_S11819_L004.bam O=test.dup.bam M=marked_dup_metrics.txt
INFO    2024-07-03 15:25:31     MarkDuplicates

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
**********
https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    MarkDuplicates -I WA02_i5-537_i7-98_S11819_L004.bam -O test.dup.bam -M marked_dup_metrics.txt
**********

15:25:31.262 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/picard/build/libs/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Wed Jul 03 15:25:31 CEST 2024] MarkDuplicates INPUT=[WA02_i5-537_i7-98_S11819_L004.bam] OUTPUT=test.dup.bam METRICS_FILE=marked_dup_metrics.txt    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false FLOW_MODE=false FLOW_DUP_STRATEGY=FLOW_QUALITY_SUM_STRATEGY USE_END_IN_UNPAIRED_READS=false USE_UNPAIRED_CLIPPED_END=false UNPAIRED_END_UNCERTAINTY=0 UNPAIRED_START_UNCERTAINTY=0 FLOW_SKIP_FIRST_N_FLOWS=0 FLOW_Q_IS_KNOWN_END=false FLOW_EFFECTIVE_QUALITY_THRESHOLD=15 ADD_PG_TAG_TO_READS=true REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Wed Jul 03 15:25:31 CEST 2024] Executing as qnick@Barbus on Linux 6.5.0-41-generic amd64; OpenJDK 64-Bit Server VM 17.0.11+9-Ubuntu-122.04.1; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: 3.2.0-1-gd8d87c971-SNAPSHOT
INFO    2024-07-03 15:25:31     MarkDuplicates  Start of doWork freeMemory: 377343072; totalMemory: 402653184; maxMemory: 32178700288
INFO    2024-07-03 15:25:31     MarkDuplicates  Reading input file and constructing read end information.
INFO    2024-07-03 15:25:31     MarkDuplicates  Will retain up to 116589493 data points before spilling to disk.
[Wed Jul 03 15:25:35 CEST 2024] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.07 minutes.
Runtime.totalMemory()=6861881344
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.NullPointerException: Cannot invoke "htsjdk.samtools.SAMReadGroupRecord.getReadGroupId()" because the return value of "htsjdk.samtools.SAMRecord.getReadGroup()" is null
        at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:558)
        at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:270)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:281)
        at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:105)
        at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:115)

or from gatk

gatk MarkDuplicates I=WA02_i5-537_i7-98_S11819_L004.bam O=test.dup.bam M=marked_dup_metrics.txt
Using GATK jar /opt/gatk-4.6.0.0/gatk-package-4.6.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/gatk-4.6.0.0/gatk-package-4.6.0.0-local.jar MarkDuplicates I=WA02_i5-537_i7-98_S11819_L004.bam O=test.dup.bam M=marked_dup_metrics.txt
INFO    2024-07-03 15:26:21     MarkDuplicates

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
**********
https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    MarkDuplicates -I WA02_i5-537_i7-98_S11819_L004.bam -O test.dup.bam -M marked_dup_metrics.txt
**********

15:26:21.393 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/gatk-4.6.0.0/gatk-package-4.6.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Wed Jul 03 15:26:21 CEST 2024] MarkDuplicates INPUT=[WA02_i5-537_i7-98_S11819_L004.bam] OUTPUT=test.dup.bam METRICS_FILE=marked_dup_metrics.txt    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false FLOW_MODE=false FLOW_DUP_STRATEGY=FLOW_QUALITY_SUM_STRATEGY USE_END_IN_UNPAIRED_READS=false USE_UNPAIRED_CLIPPED_END=false UNPAIRED_END_UNCERTAINTY=0 UNPAIRED_START_UNCERTAINTY=0 FLOW_SKIP_FIRST_N_FLOWS=0 FLOW_Q_IS_KNOWN_END=false FLOW_EFFECTIVE_QUALITY_THRESHOLD=15 ADD_PG_TAG_TO_READS=true REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=2 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Wed Jul 03 15:26:21 CEST 2024] Executing as qnick@Barbus on Linux 6.5.0-41-generic amd64; OpenJDK 64-Bit Server VM 17.0.11+9-Ubuntu-122.04.1; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: 4.6.0.0
INFO    2024-07-03 15:26:21     MarkDuplicates  Start of doWork freeMemory: 189357416; totalMemory: 234881024; maxMemory: 32178700288
INFO    2024-07-03 15:26:21     MarkDuplicates  Reading input file and constructing read end information.
INFO    2024-07-03 15:26:21     MarkDuplicates  Will retain up to 116589493 data points before spilling to disk.
[Wed Jul 03 15:26:25 CEST 2024] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.07 minutes.
Runtime.totalMemory()=3774873600
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
java.lang.NullPointerException: Cannot invoke "htsjdk.samtools.SAMReadGroupRecord.getReadGroupId()" because the return value of "htsjdk.samtools.SAMRecord.getReadGroup()" is null
        at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:558)
        at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:270)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:281)
        at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:37)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:166)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:209)
        at org.broadinstitute.hellbender.Main.main(Main.java:306)

BTW I'm not sure what is "stacktrace" so I do not add one.

Steps to reproduce

bam files obtained as follows:

@PG     ID:samtools     PN:samtools     PP:bwa  VN:1.13 CL:samtools view -S -bh /mnt/Dane/Szop/Exome/aligned/sam/WA02_i5-537_i7-98_S11819_L004.sam
@PG     ID:samtools.1   PN:samtools     PP:samtools     VN:1.20 CL:samtools view -h WA02_i5-537_i7-98_S11819_L004.bam

commands were gatk MarkDuplicates I=WA02_i5-537_i7-98_S11819_L004.bam O=test.dup.bam M=marked_dup_metrics.txt picard MarkDuplicates I=WA02_i5-537_i7-98_S11819_L004.bam O=test.dup.bam M=marked_dup_metrics.txt (in the latter picard is alias for "java -jar /path/to/picard.jar"

Expected behavior

I expect the normal action of the function - i.e. output file with marked duplicates

Actual behavior

See above

kockan commented 4 months ago

Could you check your BAM header to see if it contains any read groups? If not, you might need to use AddOrReplaceReadGroups

konopinski commented 4 months ago

That was my case. If anyone encounters the same problem the full solution is: Check if there's @RG (read group) line in the problematic .bam header:

samtools view -H <your bam file> | grep '@RG'

If nothing is found, add 'read group' by e.g.

samtools addreplacerg -r "@RG\tID:ReadGroup1\tSM:SampleName\tPL:Illumina\tLB:Library.fa" -o <output.bam> <input.bam>

You can set ReadGroup1 and SampleName and Library.fa to something really existing in your dataset (lane number, real sample name and the name of the fasta file). I am not sure about Illumina but it is true for my case so I left it as it is.

By the way - this is my call to anyone giving advice on bioinformatic tools. I really appreciate your time and effort - I really do. But think of your answer as if you were talking to a regular high-school student. Most of us here know Linux and programming as much as it is necessary to answer our biological questions. We do not know most of the IT guy's slang. Please, be more patient and give us more detailed solutions because otherwise you waste your time answering and we waste our time reading it. For us it's biology that matters, and all the bioinformatic tools are just the tools. In the end, you do not need to know how the processor works to use the computer.

broadinstitute / gatk