dnanexus-rnd / GLnexus

Scalable gVCF merging and joint variant calling for population sequencing projects
Apache License 2.0
145 stars 37 forks source link

account for discovered_alleles data structures in memory budget #199

Closed mlin closed 4 years ago

mlin commented 4 years ago

When the allele discovery stage runs on large WGS/WES cohorts, the discovered_alleles data structures use a considerable amount of memory not currently accounted for in the --mem-gbytes budget. This can cause an OOM problem trying to run everything in one shot (as opposed to splitting up chromosomes or such into separate runs).

mlin commented 4 years ago

9185487 ameliorates this by running allele dsicovery while RocksKeyValue is still open in BULK_LOAD mode, during which it has a much smaller LRU block cache (1/4 of memory budget instead of 3/4)

williamrowell commented 6 months ago

It looks like the memory budget still exceeds --mem-gbytes.

> glnexus_cli --threads 24 --mem-gbytes 96 --config DeepVariant_unfiltered --dir GLnexus.DB *.g.vcf.gz > out.deepvariant.glnexus.bcf
[15] [2024-03-11 18:04:21.304] [GLnexus] [info] glnexus_cli release v1.4.3-0-gcecf42e Sep 20 2021
[15] [2024-03-11 18:04:21.304] [GLnexus] [info] detected jemalloc 5.2.1-0-gea6b3e973b477b8061e0076bb257dbd7f3faa756
...
[15] [2024-03-11 18:07:36.693] [GLnexus] [info] genotyping 7355593 sites; sample set = *@3 mem_budget = 103079215104 threads = 24

In general, do you have any recommendations around how much memory to request over the amount specified in --mem-gbytes?