dnanexus-rnd / GLnexus

Scalable gVCF merging and joint variant calling for population sequencing projects
Apache License 2.0
149 stars 38 forks source link

out of memory for large dataset #206

Closed hurleyLi closed 4 years ago

hurleyLi commented 4 years ago

Hi, I'm trying to run GLnexus on a large dataset with 9,000 exome gvcfs, and I always has the issues of running out of memory. I run them on a 190G machine, and specified the exon bed file.

Any suggestions on how I should deal with this issue? Thanks! Hurley

mlin commented 4 years ago

Hi @hurleyLi try setting --mem-gbytes 100 or something else conservative relative to the available system memory.

This is not supposed to be necessary, but we did fix an issue #199 recently on master branch (not yet in the released binary) which would cause memory usage to exceed the budget. If you have a moment to paste the last few messages in the console log before the OOM problem happens, I can confirm if it's consistent with that issue.

mlin commented 4 years ago

@hurleyLi the new v1.2.4 release fixes the aforementioned memory budget overrun; without the logs I'm not certain if that was what you were hitting, but it's a consistent possibility. LMK how you are doing, thanks

mlin commented 4 years ago

@hurleyLi how are we doing on this issue?

hurleyLi commented 4 years ago

Hi @mlin , thanks for the new release. I haven't got a chance to test it, as our cluster doesn't have the latest centos version, which seems to be required by the new release. I've been using the old version and talking with our IT at the same time, hoping they could update the centos version and install the latest GLnexus. I'll keep you posted!

Best,

mlin commented 4 years ago

@hurleyLi thanks for the update -- keeping the 'static' executable compatible with older OS is a bit of whack-a-mole as new glibc symbol dependencies creep in and have to be overridden. I can try to look into what the problem is there. Do you by any chance have the error message giving the specific symbols causing problems on the older centos?

hurleyLi commented 4 years ago

Sorry it's been a long time... I finally managed to upgrade to the 1.2.6 version, but still have the problem of OOM. Here're the last few lines from the stderr:

[32519] [2020-09-03 00:45:30.955] [GLnexus] [info] Loaded 1032 datasets with 1032 samples; 1877991987712 bytes in 12738644516 BCF records (99180074 duplicate) in 99826811 buckets. Bucket max 902704 bytes, 5943 records. 155290 BCF records skipped due to caller-specific exceptions
[32519] [2020-09-03 00:45:31.022] [GLnexus] [info] Created sample set *@1032
[32519] [2020-09-03 00:45:31.022] [GLnexus] [info] Flushing database...
[32519] [2020-09-03 00:46:35.643] [GLnexus] [info] Bulk load complete!
[32519] [2020-09-03 00:46:35.654] [GLnexus] [warning] Processing full length of 86 contigs, as no --bed was provided. Providing a BED file with regions of interest, if applicable, can speed this up.
[32519] [2020-09-03 00:46:35.666] [GLnexus] [info] found sample set *@1032
[32519] [2020-09-03 00:46:35.666] [GLnexus] [info] discovering alleles in 86 range(s) on 6 threads
[32519] [2020-09-03 02:05:54.280] [GLnexus] [info] discovered 91748151 alleles
[32519] [2020-09-03 02:10:12.993] [GLnexus] [info] unified to 32804441 sites cleanly with 33230195 ALT alleles. 0 ALT alleles were additionally included in monoallelic sites and 12996380 were filtered out on quality thresholds.
[32519] [2020-09-03 02:10:12.994] [GLnexus] [info] Finishing database compaction...
[32519] [2020-09-03 02:10:18.376] [GLnexus] [info] genotyping 32804441 sites; sample set = *@1032 mem_budget = 118111600640 threads = 8

My command:

glnexus_cli --config xAtlas --list filelist -m 110 -t 8 > output.bcf

I reserved 120G memory. Here're the error message when my job was stopped:

job 2745157 exceeded MEM usage hard limit (130189 > 122880) delasync job deleted

I'm wondering does -m specify total memory, or per thread?

Thanks for your time!

mlin commented 4 years ago

Hi @hurleyLi I'll look into this a little more but my only immediate suggestions are

  1. make sure jemalloc is loaded as discussed on Performance. glnexus_cli prints a warning within its first few log lines if it isn't detected. jemalloc is absolutely needed to keep memory fragmentation (thus peak usage) under control when a lot of multithreaded activity is going on, as in GLnexus.
  2. reduce -m more, perhaps even by half to be safe. It is a process-wide setting, not per-thread. But the genotyping stage where your log indicates the failure is happening doesn't actually need tons of working memory, it just fills up what you give it with a LRU cache. The lower memory budget will have some impact on the speed of the upstream loading/sorting stage, but nothing catastrophic.
hurleyLi commented 4 years ago

I did get a warning about jemalloc being absent... I'll try to ask our IT to install it. Anyway, I drop the memory to -m 30 and -t 4, it works. It finished joint call for about 1000 samples in a day. I guess I'll have to install the jemalloc to improve the performance (or maybe I should've specified the memory to 80G). Thanks for your help!

yangyxt commented 1 year ago

Hi I also run into this kind of issue with the latest release (using singularity)

These are the final log lines: image

This is the command line I ran: glnexus_cli --config DEEPVARIANTWES --dir

--threads 15 ${vcfs[@]} >

The singularity image version is 1.4.1