broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.69k stars 588 forks source link

Unusual CPU Load Spikes for SplitNCigar #7914

Open von1laughing opened 2 years ago

von1laughing commented 2 years ago

Bug Report

Affected tool(s) or class(es)

gatk SplitNCigarReads

Affected version(s)

Description

I produced the bam files using STAR, and adjusted the MQ value to 60. I then used sambamba markdup to mark duplicate, then I proceeded to use SplitNCigarReads.

The CPU load for SplitNCigarReads was very high and at certain times can spike up to 2400%. I tried limiting the cpu usage with commands like -XX:ParallelGCThreads=1 and -XX:ConcGCThreads=1, but it doesn't seem to have an effect. (The cpu usage sometimes do stay at 100%) I also adjusted the MQ value in STAR to lessen the load in SplitNCigarReads. I also tried to increase the read size to reduce I/O time. image

Steps to reproduce

STAR

STAR \
--genomeDir ${star_reference_path} \
--runThreadN 16 \
--readFilesIn ${file_1} ${file_2} \
--readFilesCommand "gunzip -c" \
--sjdbOverhang 149 \
--outSAMtype BAM SortedByCoordinate \
--outBAMsortingThreadN 16 \
--outSAMmultNmax 1 \
--outSAMmapqUnique 60 \
--outSAMattrRGline ID:${id} LB:RNASEQ SM:${sample_name} PL:ILLUMINA PU:${platform_unit} PM:${instrument_id} \
--limitBAMsortRAM 50000000000 \
--twopassMode Basic \
--outFileNamePrefix /rawdata/rnaseq/clean/bam/1.

Mark Duplicate

sambamba markdup \
-t 4 \
--tmpdir=/tmp \
--hash-table-size=262144 \
--overflow-list-size=67108864 \
 /rawdata/rnaseq/clean/bam/1.Aligned.sortedByCoord.out.bam \
 /rawdata/rnaseq/clean/bam/1.aligned.duplicate_marked.sorted.bam \

SplitNCigarReads

gatk --java-options "-Djava.io.tmpdir=/tmp -Xmx20G -XX:ParallelGCThreads=1 -XX:ConcGCThreads=1" SplitNCigarReads \
-R ${reference_path} \
--tmp-dir /tmp \
-I /rawdata/rnaseq/clean/bam/1.aligned.duplicate_marked.sorted.bam \
-O /rawdata/rnaseq/clean/bam_gatk/1.aligned.duplicate_marked.sorted.bam \
--create-output-bam-md5 TRUE \
--max-reads-in-memory 1000000 \
--skip-mapping-quality-transform TRUE \
droazen commented 2 years ago

@von1laughing Can you try running jstack on the running GATK process when the CPU usage is ~2400%, and paste the output here? This will produce a dump of the Java threads. You'll need to provide jstack with the process ID (pid) of the running Java process.