dnanexus-rnd / GLnexus

Scalable gVCF merging and joint variant calling for population sequencing projects
Apache License 2.0
142 stars 37 forks source link

No space on device error #225

Closed jimhavrilla closed 4 years ago

jimhavrilla commented 4 years ago

So, when I try to merge 49,953 gVCFs into a pVCF for just chromosome 22 using the command (passed into docker image via bash -c) :

glnexus_cli --config weCall_unfiltered --bed in/chroms/chr22.bed in/fixedgvcfs/*.gz > in/ukbchr22.bcf

I get the following error at about 28k samples loaded:

[2020-06-06 20:14:31.449] [GLnexus] [info] Loaded 28399 datasets with 28399 samples; 12751105984064 bytes in 123428107856 BCF records (74432681 duplicate) in 1065073421 buckets. Bucket max 213584 bytes, 1865 records. 0 BCF records skipped due to caller-specific exceptions
[7] [2020-06-06 21:45:14.525] [GLnexus] [error] Failed to bulk load into DB: IOError: RocksDB kIOError (IO error: No space left on deviceWhile appending to file: GLnexus.DB/000995.sst: No space left on device)

Now I understand this is a large file, but I should have ~10TB left on my device and all of the full gVCFs combined and their tabix files only add up to 3.4TB, so why is it taking up 12751105984064 bytes (12.7 TB!) just for chromosome 22? I'm hoping I'm just being dumb and the command is wrong but this seems pretty crazy to me.

Thanks in advance,

Jim

mlin commented 4 years ago

The 12751105984064 bytes in the log message is an uncompressed size so it's not really indicative of what will hit the disk compressed. What's going on in that stage is a big external sort of all the gVCF records, which can temporarily ~double the disk space usage compared to the finally sorted version. Also, we reduce the compression level for these temporary external sorting files so that it's not so costly in CPU usage. It's at least conceivable that it would briefly spike to ~10TB given 3.4TB input, but this is hand-wavy I admit.

Another thing to look into is whether, given the docker container configuration, the GLnexus.DB directory is being written to the intended storage device. This can get rather confusing, but in short if the working directory isn't explicitly mounted, it's being written in the Docker "overlay" file system which might be on a different device (and/or might have some kind of additional space usage overhead, I'm not sure)

jimhavrilla commented 4 years ago

We may just need to get more space free on our disk drives unfortunately.

Yes, I have mounted the drives where the files are stored (and to be stored) as instructed in your "Getting Started" wiki page. I think it is good advice to make sure where the file is being stored temporarily. I'm not sure where GLnexus is putting it though based on the log information, I assumed it was streaming to the bcf file in the volume I mounted.

mlin commented 4 years ago

It might still be the case that it's writing the external sorting files (temporary files which are neither the input gvcfs, nor the output bcf) in an unexpected location. They're written into the GLnexus.DB/ subdirectory of the docker container's working directory where you invoke glnexus_cli. Unless you cd into a mounted directory, I suspect those will end up getting written to a (convoluted) location under /var/lib/docker, though that path is config-dependent. This is probably the case if you don't see a GLnexus.DB directory left over somewhere on the host filesystem after the container exits. This is something to look into if /var/lib/docker would be on some other file system without as much space available.

jimhavrilla commented 4 years ago

Thanks again for the quick response Mike. So is the advice to: A) Change DOCKER_TMP so GLnexus writes somewhere other than /var/lib/docker/tmp or B) use docker -v to mount a volume to /GLnexus.DB?

Which would be better if I want the temporary GLnexus.DB/somenumber.sst file stored somewhere with large storage?

Sorry to keep bugging you.

Best,

Jim

On Mon, Jun 8, 2020 at 6:34 PM Mike Lin notifications@github.com wrote:

It might still be the case that it's writing the external sorting files (temporary files which are neither the input gvcfs, nor the output bcf) in an unexpected location. They're written into the GLnexus.DB/ subdirectory of the docker container's working directory where you invoke glnexus_cli. Unless you cd into a mounted directory, I suspect those will end up getting written to a (convoluted) location under /var/lib/docker, though that path is config-dependent. This is probably the case if you don't see a GLnexus.DB directory left over somewhere on the host filesystem after the container exits. This is something to look into if /var/lib/docker would be on some other file system without as much space available.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dnanexus-rnd/GLnexus/issues/225#issuecomment-640922614, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSDYBBHQT7C7CKMRXPQIV3RVVROJANCNFSM4NXAP65A .

mlin commented 4 years ago

I haven't tested this thoroughly, but I think you could try something like

docker run -v /spacious:/work -v /gvcf/dir:/work/in --workdir /work quay.io/mlin/glnexus:v1.2.6 glnexus_cli ...

where /spacious is on the filesystem with plenty of free space. Then when the cli creates the GLnexus.DB subdirectory for the external sorting files (= the RocksDB database) under /work, that will actually go onto the host /spacious

jimhavrilla commented 4 years ago

Oh ok, so just setting the workdir explicitly will fix where the temp files are being stored. Gotcha. Thanks again for your help. I'll give this a shot.

Jim Havrilla

On Wed, Jun 10, 2020, 6:01 AM Mike Lin notifications@github.com wrote:

I haven't tested this thoroughly, but I think you could try something like

docker run -v /spacious:/work -v /gvcf/dir:/work/in --workdir /work quay.io/mlin/glnexus:v1.2.6 glnexus_cli ...

where /spacious is on the filesystem with plenty of free space. Then when the cli creates the GLnexus.DB subdirectory for the external sorting files (= the RocksDB database) under /work, that will actually go onto the host /spacious

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dnanexus-rnd/GLnexus/issues/225#issuecomment-641896209, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSDYBEGF3M4AINWUUTPMOLRV5KW3ANCNFSM4NXAP65A .

jimhavrilla commented 4 years ago

So I took the advice, and after I tried to create the chr21 merged bcf, after about a day of running I hit this error:

[350] [2020-06-12 13:28:24.015] [GLnexus] [info] 30100/49953
(4021208_23176_0_0)...
[7] [2020-06-12 13:44:46.844] [GLnexus] [info] Loaded 30183 datasets with
30183 samples; 13548774610016 bytes in 131149622294 BCF records (79109993
duplicate) in 1131980311 buckets. Bucket max 213584 bytes, 1865 records. 0
BCF records skipped due to caller-specific exceptions
[7] [2020-06-12 14:17:34.514] [GLnexus] [error] Failed to bulk load into
DB: Failure: corruption

Any idea what's going on there? Another space error, or mounting error? It feels like I should just run this on the cloud.

On Wed, Jun 10, 2020 at 6:02 AM Jim Havrilla semjaavria@gmail.com wrote:

Oh ok, so just setting the workdir explicitly will fix where the temp files are being stored. Gotcha. Thanks again for your help. I'll give this a shot.

Jim Havrilla

On Wed, Jun 10, 2020, 6:01 AM Mike Lin notifications@github.com wrote:

I haven't tested this thoroughly, but I think you could try something like

docker run -v /spacious:/work -v /gvcf/dir:/work/in --workdir /work quay.io/mlin/glnexus:v1.2.6 glnexus_cli ...

where /spacious is on the filesystem with plenty of free space. Then when the cli creates the GLnexus.DB subdirectory for the external sorting files (= the RocksDB database) under /work, that will actually go onto the host /spacious

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dnanexus-rnd/GLnexus/issues/225#issuecomment-641896209, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSDYBEGF3M4AINWUUTPMOLRV5KW3ANCNFSM4NXAP65A .

mlin commented 4 years ago

RocksDB stores and checks CRC32 checksums of all the files it writes out to storage (actually, for shorter blocks therein). The "corruption" error happens when this checksum verification fails reading out previously stored data. It's possible that some actual storage corruption occurred, but it's also not unheard of for some other problem (such as disk space, or a temporary outage/maintenance of network storage) to manifest in this way due to incomplete error code propagation in the I/O stack.

Sorry it's hard to make specific suggestions for that error other than try it again in case it was an actual random storage corruption...

jimhavrilla commented 4 years ago

Well we have been having mounting issues. It's possible some of the workdir was affected. Weird because I didn't do it on that drive, but if your suspicion is the same, maybe rerunning will produce better results.

I think it would be more expeditious to run several parallel merges of about 100 samples at a time and then merge those... Though maybe GLnexus already is optimized for that.

Jim Havrilla

On Tue, Jun 16, 2020, 6:09 PM Mike Lin notifications@github.com wrote:

RocksDB stores and checks CRC32 checksums of all the files it writes out to storage (actually, for shorter blocks therein). The "corruption" error happens when this checksum verification fails reading out previously stored data. It's possible that some actual storage corruption occurred, but it's also not unheard of for some other problem (such as disk space, or a temporary outage/maintenance of network storage) to manifest in this way due to incomplete error code propagation in the I/O stack.

Sorry it's hard to make specific suggestions for that error other than try it again in case it was an actual random storage corruption...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dnanexus-rnd/GLnexus/issues/225#issuecomment-645037298, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSDYBDHNHQM4B4LPOWBPV3RW7UTFANCNFSM4NXAP65A .

mlin commented 4 years ago

The hierarchical merging is sort of what's happening inside RocksDB with its LSM tree. It would be helpful for sporadic errors like this to have the ability to resume the bulk load using partial results of a failed attempt. That would not be as extremely complicated as one might imagine, given that it's already loading a 'database' which supports atomic commit; but it takes a lot of diligence to test such functionality to the extent one would want in order to really have confidence using it.

jimhavrilla commented 4 years ago

If it tells me I have the GLnexus.DB already, is there a way I can pass the preexisting one in again?

On Wed, Jun 17, 2020 at 2:23 AM Mike Lin notifications@github.com wrote:

The hierarchical merging is sort of what's happening inside RocksDB with its LSM tree https://en.wikipedia.org/wiki/Log-structured_merge-tree. It would be helpful for sporadic errors like this to have the ability to resume the bulk load using partial results of a failed attempt. That would not be as extremely complicated as one might imagine, given that it's already loading a 'database' which supports atomic commit; but it takes a lot of diligence to test such functionality to the extent one would want in order to really have confidence using it.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dnanexus-rnd/GLnexus/issues/225#issuecomment-645177325, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSDYBB7QSRSSWJ52QEXLADRXBOP3ANCNFSM4NXAP65A .

mlin commented 4 years ago

Not as-is, but the driver code can be hacked/refactored to reuse it without a lot of deeper digging into the guts. The problem I mentioned is it would be irresponsible for us to recommend anyone do that that without some testing scheme to validate that data consistency is in fact recovered from the "corruption" error. The last thing anyone wants is it claims to succeed while subtly screwing up the data in some way.

jimhavrilla commented 4 years ago

I see. So for now, best to delete and start over huh? Thanks for trying to figure this out with me. I appreciate it. I'm running again, and if it doesn't work I'll try just running in parallel on sets of 100 files and let you know how it goes.

Jim Havrilla

On Wed, Jun 17, 2020, 5:12 PM Mike Lin notifications@github.com wrote:

Not as-is, but the driver code https://github.com/dnanexus-rnd/GLnexus/blob/ca0e9b774c8006faa5c3d7659f89653bbbc92d6b/cli/glnexus_cli.cc#L63-L89 can be hacked/refactored to reuse it without a lot of deeper digging into the guts. The problem I mentioned is it would be irresponsible for us to recommend anyone do that that without some testing scheme to validate that data consistency is in fact recovered from the "corruption" error. The last thing anyone wants is it claims to succeed while subtly screwing up the data in some way.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dnanexus-rnd/GLnexus/issues/225#issuecomment-645627376, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSDYBA63PDB5LDKNKGQIQ3RXEWS5ANCNFSM4NXAP65A .

jimhavrilla commented 4 years ago

Hey Mike,

We had a server issue, but it was running extremely well this time. Is there any way I can make it faster if I run it on a more powerful machine? Does it automatically upscale? Are there arguments I can pass? It took 2 days on this machine to almost finish chromosome X for 50k samples.

It was nice meeting you on Twitter btw.

Best,

Jim

mlin commented 4 years ago

Glad to hear it -- there are a bunch of perf guidelines here: https://github.com/dnanexus-rnd/GLnexus/wiki/Performance

It scales (automatically yes) pretty well up to at least 64 hardware threads.

jimhavrilla commented 4 years ago

Great, we have 32 cores on this machine and there are several GPUs and 1 TB of RAM. One thing that is a bit unclear to me from the Wiki is does it also use all the memory by default or should I tell it it is ok to use a ton of memory (because I can use up to 1 TB if that makes it any faster).

On Fri, Jun 19, 2020 at 6:11 PM Mike Lin notifications@github.com wrote:

Glad to hear it -- there are a bunch of perf guidelines here: https://github.com/dnanexus-rnd/GLnexus/wiki/Performance

It scales (automatically yes) pretty well up to at least 64 hardware threads.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dnanexus-rnd/GLnexus/issues/225#issuecomment-646877693, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSDYBHIBN36LS7BSZXBCB3RXPPBPANCNFSM4NXAP65A .

mlin commented 4 years ago

Yes it detects the system memory and tries to use most of it by default. It definitely doesn't need 1TB, but it's hard to say whether that's "too much." More memory should speed up the external sort by requiring fewer merge passes (while it says "Compacting database..."). But an excessively large heap can generally slow down the memory allocator as it has to keep track of it all. Make sure to use jemalloc in any case; htslib's VCF code is pretty allocation-heavy and jemalloc deals with that + high thread counts much better than glibc.

jimhavrilla commented 4 years ago

From the documentation though, if I am using your docker image, v1.2.6-2-gca0e9b7 from quay.io then jemalloc is in there by default, yes? Then I should be good to go. Thanks.

On Sat, Jun 20, 2020 at 4:48 AM Mike Lin notifications@github.com wrote:

Yes it detects the system memory and tries to use most of it by default. It definitely doesn't need 1TB, but it's hard to say whether that's "too much." More memory should speed up the external sort by requiring fewer merge passes (while it says "Compacting database..."). But an excessively large heap can generally slow down the memory allocator as it has to keep track of it all. Make sure to use jemalloc in any case; htslib's VCF code is pretty allocation-heavy and jemalloc deals with that + high thread counts much better than glibc.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dnanexus-rnd/GLnexus/issues/225#issuecomment-646964552, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSDYBFGAOKWX7SIR7O6DHDRXRZWPANCNFSM4NXAP65A .

jimhavrilla commented 4 years ago

I should say as an update I have been able to successfully make bcf files this way now, and will now close this issue. Thanks for your help, Mike.

mlin commented 4 years ago

:tada: We've had a few success reports now with the 50K freeze, which has been a relief as it wasn't obvious whether the open-source version was going to handle it. Looking forward to 200K in the fall............

jimhavrilla commented 4 years ago

Yeah that's terrifying. I hope my paper is done by then, I really don't want to have to restart this process again for 200k. It's very slow on cluster. Takes about 3 days a chromosome. Not as fast as on cloud. Hard to parallelize when we don't have many huge machines that have Docker/Singularity.

Jim Havrilla

On Tue, Jul 7, 2020, 7:26 PM Mike Lin notifications@github.com wrote:

🎉 We've had a few success reports now with the 50K freeze, which has been a relief as it wasn't obvious whether the open-source version was going to handle it. Looking forward to 200K in the fall............

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/dnanexus-rnd/GLnexus/issues/225#issuecomment-655190923, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSDYBFIO7JTJPOYBWNTLATR2OVK7ANCNFSM4NXAP65A .