broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.69k stars 589 forks source link

GetPileupSummaries Permanently Stalled but High CPU Usage #8654

Open DarioS opened 9 months ago

DarioS commented 9 months ago

CPU usage was high.

                                %CPU  WallTime  Time Lim     RSS    mem memlim cpus
hugemem-ex
105958574 R ds6924 hm82 getpileu  99  07:23:01  16:00:00 1120GB 1120GB 1400GB    37

However, it never proceeds past the first interval.

08:21:42.921 INFO  GetPileupSummaries - The Genome Analysis Toolkit (GATK) v4.4.0.0
08:21:42.921 INFO  GetPileupSummaries - Start Date/Time: January 12, 2024 at 8:21:42 AM GMT+10:00
08:21:42.927 INFO  GetPileupSummaries - Initializing engine
08:55:35.361 INFO  IntervalArgumentCollection - Processing 326649654 bp from intervals
08:57:45.036 INFO  GetPileupSummaries - Done initializing engine
08:57:45.101 INFO  ProgressMeter - Starting traversal
08:57:45.106 INFO  ProgressMeter -        Current Locus  Elapsed Minutes        Loci Processed      Loci/Minute
(END)

There is a memory error in some log files but only after many hours and no intervals processed.

08:34:26.243 INFO  ProgressMeter - Starting traversal
08:34:26.244 INFO  ProgressMeter -        Current Locus  Elapsed Minutes        Loci Processed      Loci/Minute
15:35:01.977 INFO  GetPileupSummaries - Shutting down engine
[January 12, 2024 at 3:35:02 PM GMT+10:00] org.broadinstitute.hellbender.tools.walkers.contamination.GetPileupSummaries done. Elapsed time: 433.32 minutes.
Runtime.totalMemory()=31136546816
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
droazen commented 9 months ago

@DarioS How much memory are you providing to Java via the -Xmx option, and how much physical memory do you have available? You can see how to pass the -Xmx option in to GATK here: https://github.com/broadinstitute/gatk?tab=readme-ov-file#jvmoptions

DarioS commented 9 months ago

-Xmx52g was used. Compute node has 1.5 TB physical RAM. I use af-only-gnomad.hg38.vcf.gz for -V and -L.

droazen commented 9 months ago

@DarioS You could try increasing the size of the Java heap (say, doubling it to 104g). Does your bam/cram have extremely high depth?

DarioS commented 9 months ago

I copied 60× BAM file to an interactive Linux server with 768 GB physical RAM and eighty cores and used version 4.5.0.0.

%Cpu(s):  1.3 us,  0.0 sy,  0.1 ni, 98.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
GiB Mem :    754.5 total,     52.1 free,    107.3 used,    600.3 buff/cache     
GiB Swap:    931.3 total,    924.9 free,      6.4 used.    647.3 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                 
 171365 dario     20   0   35.0g  31.3g  23040 S 100.0   4.1  32:18.12 java   

I removed -Xmx and using top to see the process is consistently at about 32 GB. So, -Xmx is irrelevant to the problem.

12:15:04.531 INFO  ProgressMeter -        Current Locus  Elapsed Minutes        Loci Processed      Loci/Minute
12:57:32.208 INFO  GetPileupSummaries - Shutting down engine
[January 13, 2024 at 12:57:32 PM AEDT] org.broadinstitute.hellbender.tools.walkers.contamination.GetPileupSummaries done. Elapsed time: 50.36 minutes.
Runtime.totalMemory()=20753416192
java.lang.OutOfMemoryError: Java heap space: failed reallocation of scalar replaced objects

What does "reallocation of scalar replaced objects" mean? I don't think it could possibly have run out of memory.

amarinderthind commented 9 months ago

I am in a similar boat. Xmx has a default value which is small. Using a specified 448 GB limit shows that this module is inefficient.

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 172833 thind     20   0  468.7g 378.7g  31360 S  99.9  50.2  29:16.31 java

The analysis dies a few seconds later because GATK tries to create impossibly-large Java array.

org.broadinstitute.hellbender.tools.walkers.contamination.GetPileupSummaries done. Elapsed time: 25.63 minutes.
Runtime.totalMemory()=481036337152
java.lang.OutOfMemoryError: Required array length 2147483640 + 16 is too large
        at java.base/jdk.internal.util.ArraysSupport.hugeLength(ArraysSupport.java:649)
        at java.base/jdk.internal.util.ArraysSupport.newLength(ArraysSupport.java:642)

I can independently reproduce Dario's problem on the same Linux server.