dnanexus-rnd / GLnexus

Scalable gVCF merging and joint variant calling for population sequencing projects
Apache License 2.0
142 stars 37 forks source link

Running out space #242

Open leorippel opened 3 years ago

leorippel commented 3 years ago

Hi,

The GLnexus team have an estimation of how much space in HD is required, compared to the sizes of the .gvcfs?

Cheers.

mlin commented 3 years ago

The disk is basically used for external sorting the input gvcf data, so it'll need roughly their same size. I'd be surprised if it used less than half or more than double, depending on what gzip compression level was used to create the inputs.

The external sorting is an alternative to having to fit all of the inputs in memory or try to keep 100K's of files open at once, of course.

gjun commented 3 years ago

I am using glnexus_cli to jointly genotype ~2,000 WGS gVCFs generated by GATK. Total gVCF size is less than 15TB but GLnexus.DB directory is currently at 47TB and still running. Is this behavior normal? I've included chr1-22, X, Y in a single BED file.

A related question. Is there any way to estimate the approximate time to completion while running from the LOG messages?

mlin commented 3 years ago

Hmm, if you're able to share the GLnexus.DB/LOG file, this might shed some light on what's going on (it does not contain anything sensitive -- it's just the log of the LSM tree operations, which are the determinant of the peak space usage)

gjun commented 3 years ago

LOG.gz

I attached the LOG file.