broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.64k stars 580 forks source link

GenotypeGVCFs report java.lang.OutOfMemoryError: Java heap space while call incremental imported GenomicsDB #8777

Open LYOKOIIIYYR opened 2 months ago

LYOKOIIIYYR commented 2 months ago

Bug Report

Affected tool(s) or class(es)

gatk GenomicsDBImport GenotypeGVCFs

Affected version(s)

The Genome Analysis Toolkit (GATK) v4.5.0.0

Description

Hi, Here is my situation, I'm testing the feasibility of incremental GenomicsDB,I have total 400 samples to joint calling, I have no problem directly using GenomicsDBImportand GenotypeGVCFsfor joint calling of all 400 samples. The configuration used is 4c32g for GenomicsDBImportand 2c16g for GenotypeGVCFs. But when I first built a GenomicsDB of 200 samples using GenomicsDBImportsuccessfully, and then use GenomicsDB --genomicsdb-update-workspace-path increment 200 samples into the GenomicsDB , use this incremental imported GenomicsDB to GenotypeGVCFs. The error happend and report GENOMICSDB_TIMER,Exception in thread "main" java.lang.OutOfMemoryError: Java heap space Here are my code

gatk --java-options "-Xms8000m -Xmx~{max_mem}m" \
            GenomicsDBImport \
            --tmp-dir $PWD \
            --genomicsdb-workspace-path ~{workspace_dir_name}~{prefix}.~{index} \
            --batch-size 50 \
            -L ~{intervals} \
            --reader-threads 5 \
            --merge-input-intervals \
            --consolidate \
            -V ~{sep = " -V " single_sample_gvcfs}

gatk --java-options "-Xms8000m -Xmx~{max_mem}m" \
            GenomicsDBImport \
            --tmp-dir $PWD \
            --genomicsdb-update-workspace-path ~{workspace_dir_name} \
            --batch-size 50 \
            --reader-threads 5 \
            --merge-input-intervals \
            --consolidate \
            -V ~{sep = " -V " single_sample_gvcfs}

gatk --java-options "-Xms8000m -Xmx~{max_mem}m" \
            GenotypeGVCFs \
            --tmp-dir $PWD \
            -R ~{ref} \
            -O ~{workspace_dir_name}.vcf.gz \
            -G StandardAnnotation \
            --only-output-calls-starting-in-intervals \
            -V gendb://~{workspace_dir_name} \
            -L ~{intervals} \
            --merge-input-intervals \
           -all-sites

And I found that before report error the number of threads used by GATK increased, but the memory usage did not exceed the maximum limit of the server. I also cheched --max-alternate-alleles and --genomicsdb-max-alternate-alleles to a smaller size but still the same error

I would appreciate some insights in why that is.

Thanks, Yang

gokalpcelik commented 2 months ago

Hi @LYOKOIIIYYR You seem to set your heapsize to the maximum memory size that you have which we do not recommend. GenotypeGVCFs does not need that much memory if I can recall. Can you set the heapsize to a more moderate number such as 8gb or 12 gb and try that way?

droazen commented 2 months ago

Yes, it's important to realize that GenomicsDB is implemented in C (not Java), and so the memory allocated for GenomicsDB is whatever is NOT allocated to Java (ie., whatever is left over after -Xmx). So -Xmx should never claim all of the memory on the machine, and should leave enough free memory for GenomicsDB to use.

LYOKOIIIYYR commented 2 months ago

There is no problem on runing GenomicsDBImport , and @gokalpcelik I have already tried Xmx10G to Xmx 14G and get the same error. I'm most curious about why running GenomicsDB GenotypeGVCFs directly with 400 samples on the same computational resources can succeed, while running incremental GenomicsDB GenotypeGVCFs with 200 + 200 samples fails.