broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.68k stars 587 forks source link

GenotypeGVCFs memory issues on GATK 4.6.0.0 #8918

Open jin0008 opened 1 month ago

jin0008 commented 1 month ago

Bug Report

Affected tool(s) or class(es)

GenotypeGVCFs

Affected version(s)

4.6.0.0

Description

When I was doing GenotypeGVCFs from GenomicsDB of 420 samples, the process interrupted due to significant memory issues. This process was eating up memory continuously. In 4.5.0.0, I did same process, and I confirmed it works fine.

gokalpcelik commented 1 month ago

Can you provide your logs that shows the error message?

jin0008 commented 1 month ago

There are no error messages. The process was interrupted without any error messages. I attached the screenshot. I attached chr14 variant calling (completed) and chr14 variant calling (interrupted). In the system monitor, when I am using GATK 4.6.0.0., they are eating up memory continuously. When they are reaching up to 512Gb, the process was interrupted. I tried to run this process on only 2-3 chromosomes, and I found that the process was completed on chr 14, and the process was interrupted on the rest of two chromosomes (interval -L). So I rolled back to GATK 4.5.0.0, the process was normal. I can do GenotypeGVCFs command entire chromosome simultaneously.

My machine has 512Gb memory and 64 cores (5995wx AMD threadripper) dell 7865 workstation. Thanks Jinu Han

On Fri, Jul 19, 2024 at 12:08 AM Gökalp Çelik @.***> wrote:

Can you provide your logs that shows the error message?

— Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gatk/issues/8918#issuecomment-2236819113, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG7IXWWGPB73BXPN4Z5E4VTZM7LAFAVCNFSM6AAAAABLBRETECVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZWHAYTSMJRGM . You are receiving this because you authored the thread.Message ID: @.***>

gokalpcelik commented 1 month ago

Can you provide more details on what operating system you are using and other related information such as java version etc?

Even if the process gets interrupted by the system there must be a java segfault message at some point thrown by the process. Did you observe any files with names ERR around the output file?

jin0008 commented 1 month ago

Hi, The operating system is ubuntu 20.04. java version is openjdk "17.0.11". If the process of GATK best practice has been interrupted, I could see the error messages always. But, in this time, the process was interrupted without giving any messages. This is quite weird. I checked this several other chromosomes. My callset has about 430 samples. I could run GenotypeGVCFs in GATK 4.5.0.0 version without any problem. But, in GATK 4.6.0.0, the process was successful in 3-4 chromosomes (which is smaller one I think). The process has been interrupted in incomplete stages. I could not find any ERR files in the folder. Thanks Jinu Han

On Fri, Jul 19, 2024 at 7:01 PM Gökalp Çelik @.***> wrote:

Can you provide more details on what operating system you are using and other related information such as java version etc?

Even if the process gets interrupted by the system there must be a java segfault message at some point thrown by the process. Did you observe any files with names ERR around the output file?

— Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gatk/issues/8918#issuecomment-2238814358, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG7IXWSQYT56QW4Q4YCZUPTZNDPXPAVCNFSM6AAAAABLBRETECVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZYHAYTIMZVHA . You are receiving this because you authored the thread.Message ID: @.***>

gokalpcelik commented 1 month ago

Can you tell us how much is your heap size for this task? (-Xmx? -Xms?)

icemduru commented 1 month ago

i have a similar issue. Weirdly -Xmx does not help.

gokalpcelik commented 1 month ago

@icemduru Can you provide more details on your issue? How many samples do you have? How did you combine them and what are your command lines for this process? Can you provide more details on the system that you are running these commands on?

GenotypeGVCFs is not known to have memory leak issues. Our tests indicated that it only needs around 4~6GBs of total memory to genotype 120 whole genome samples (Per contig).

icemduru commented 1 month ago

@icemduru Can you provide more details on your issue? How many samples do you have? How did you combine them and what are your command lines for this process? Can you provide more details on the system that you are running these commands on?

GenotypeGVCFs is not known to have memory leak issues. Our tests indicated that it only needs around 4~6GBs of total memory to genotype 120 whole genome samples (Per contig).

Thanks for reply. I have 370 samples. I have run HaplotypeCaller for each of them. Then run GenomicsDBImport for each of the chromosome (it is a plant genome, about 420 mb in total genome size). Then tried to run GenotypeGVCFs for each chromosome. I attached the log file for chr1. slurm-22616776.out_text.txt

gokalpcelik commented 1 month ago

Hi @icemduru Looks like your slurm workload manager was configured to have a limit of 48GBs of maximum process memory size per execution. Your java instance is set with -Xmx45G which will cover most of this limit and leaves only a handful of memory space for the native GenomicsDB library. Native libraries work above the heapsize so it is better for you to set your -Xmx to a more sensible size of 8~12GB and leave rest of the memory space to the native library to use.

Keep in mind that this memory limit on slurm could be set per user not per task therefore you may need to run a single contig at a time or maybe 2 of them simultaneously. Otherwise slurm may interefere with all the tasks and cancel all your jobs.

One final reminder. We strongly recommend users to set the temporary directory to somewhere else other than /tmp. Slurm workload manager interferes with this preference and sometimes results in premature termination of the gatk processes due to deletion of extracted native library and accessory files.

I hope this helps.

icemduru commented 1 month ago

Hi @icemduru Looks like your slurm workload manager was configured to have a limit of 48GBs of maximum process memory size per execution. Your java instance is set with -Xmx45G which will cover most of this limit and leaves only a handful of memory space for the native GenomicsDB library. Native libraries work above the heapsize so it is better for you to set your -Xmx to a more sensible size of 8~12GB and leave rest of the memory space to the native library to use.

Keep in mind that this memory limit on slurm could be set per user not per task therefore you may need to run a single contig at a time or maybe 2 of them simultaneously. Otherwise slurm may interefere with all the tasks and cancel all your jobs.

One final reminder. We strongly recommend users to set th slurm-22680938.out_text.txt e temporary directory to somewhere else other than /tmp. Slurm workload manager interferes with this preference and sometimes results in premature termination of the gatk processes due to deletion of extracted native library and accessory files.

I hope this helps.

Thank you for your help, but unfortunately it didn't resolve the issue. I've already tried allocating 10GB of memory using the -Xmx10g flag and redirecting the temporary directory away from /tmp. However, GATK is still attempting to consume more than 48GB of RAM, resulting in the termination of my run. slurm-22680938.out_text.txt

gokalpcelik commented 1 month ago

Hi again. Did you add the --consolidate true parameter to GenomicsDBImport during importing stage? It is a step which collapses each layer of import into a single layer which prevents tools to open too many files at once but it may also take sometime at the end of the importing stage. It also reduces the amount of book keeping to be done by the genotyper.

icemduru commented 3 weeks ago

Hi again. Did you add the --consolidate true parameter to GenomicsDBImport during importing stage? It is a step which collapses each layer of import into a single layer which prevents tools to open too many files at once but it may also take sometime at the end of the importing stage. It also reduces the amount of book keeping to be done by the genotyper.

Hi, Thanks for the suggestion. I have used the --consolidate true parameter to GenomicsDBImport during importing stage. However, it did not help. But I solved my problem using large memory machines. For future reference, required memory was 95.11 GB for 370 samples dataset using -Xmx8G and --disable-bam-index-caching true.