broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.69k stars 588 forks source link

Memory leak in GenotypeGVCFs with `-all-sites` #8989

Open brisk022 opened 1 week ago

brisk022 commented 1 week ago

Bug Report

Affected tool(s) or class(es)

GenotypeGVCFs with -all-sites

Affected version(s)

Description

We tried to run GenotypeGVCFs from GATK 4.5 with -all-sites on a dataset with 120 samples and GRCh37 as the reference. Each run was limited to a single chromosome. All of them failed after consuming 3 TB of memory. Subsequently, I tried a smaller subset of 8 samples limiting the memory to 32 GB and all the runs failed after 3-10 Mbp depending on the chromosome.

Finally, I randomly picked chromosome 9 and used GATK versions from 4.1 to 4.6 and only 4.1 did not experience the problem. It finished the whole chromosome (141 Mbp) with the max memory usage of around 8 GB. All others failed after 3-6 Mbp (Sorry, I used different memory settings for 4.5, so I did not include it.)

memory_usage

Time is in seconds, memory is in MB.

If I run the same command without -all-sites, the maximum memory usage is around 1.6 GB.

Steps to reproduce

GenomicDB was created using the corresponding GATK version as:

gatk --java-options "-Xmx12000m" GenomicsDBImport --genomicsdb-workspace-path tmp/genomicsdb44/9 \
    --genomicsdb-shared-posixfs-optimizations --batch-size 120 --verbosity DEBUG \
    -L 9 -V data/gatk/gvcf/9/1.g.vcf.gz -V data/gatk/gvcf/9/2.vcf.gz -V data/gatk/gvcf/9/3.g.vcf.gz \
    -V data/gatk/gvcf/9/4.g.vcf.gz -V data/gatk/gvcf/9/5.g.vcf.gz -V data/gatk/gvcf/9/6.g.vcf.gz \
    -V data/gatk/gvcf/9/7.g.vcf.gz -V data/gatk/gvcf/9/8.g.vcf.gz

GenotypeGVCFs was run as:

gatk --java-options "-Xmx12g" GenotypeGVCFs -R data/ref/hs37d5.fa.gz \
    -V gendb://tmp/genomicsdb44/9 -O data/gatk/variants/9/raw44.vcf.gz -L 9 \
    --tmp-dir ./tmp/tmp -all-sites

All runs were performed with resource_monitor and it was instructed to kill the process if it consumes more than 14000 MB of memory. Thus, at least 2 GB was allocated for reading GenomicsDB. The size of the GenomicsDB on disk is around 3.1 GB for versions >=4.2 and 3.0 GB for version 4.1.

gokalpcelik commented 1 week ago

Can you reduce the maximum number of alleles per site when you run this analysis?

brisk022 commented 1 week ago

Sure, below are the results when running with --max-alternate-alleles 5

memory_usage_ma5