Closed funnell closed 9 years ago
Tyler; Sorry about the problems. I'm not sure why the merge is failing but the warnings and multiple PG lines are normal. They're just the multiple command lines from each of the individual chunks and are present as you posted in files that work.
We'll probably need more details to be able to diagnose the underlying issue. How many chunks was the alignment split into and feed into the merge? My best guess without more information is that the worker node doing the merge runs out of memory and is killed by the scheduler or machine so bcbio doesn't know that it died. If you could look at the scheduler logs for the machine where the merge happened they might provide more details.
Thanks for helping to debug and hope this is helpful.
Hi Brad, thank you very much for the help. After digging further I think I've found the problem. Here is a gist of the output. It looks like you were right about running out of memory. https://gist.github.com/funnell/da8c3d9b22d3c6468c8f Do you have any rules of thumb for memory allocation?
Tyler; Glad that helps explain the issue. I haven't had this fail so I'm not totally sure. cc'ing in @gt1 to see if he has an idea. German, we're running bammerge and bamsormadup and running out of memory, are there any estimates for memory usage per size/number of files and ways to create more temporary files to reduce usage? This is the code in bcbio:
Tyler, how much memory does the machine that it failed on have? How many chunks are you merging? Sorry about the problem but hopefully we can improve it by digging into it.
The amount of memory used by bamsormadup will increase with every thread used. As a rough guideline I would assume it needs about 1GB of RAM per thread.
Ok bumping up the available memory did the trick. Took about 30G. Thanks!
@gt1 -- we've been running into this memory issue with bammerge piped to bamsormadup on larger file sizes and would love thoughts on setting cores/usage to reduce memory usage. We're trying to run this on a large input file: 27, 25Gb input files for a total output BAM size of just under 700Gb. It's run on a machine with 48Gb of memory and gets killed by the operating system due to memory usage. Is is possible to tweak intermediate file sizes or other things to reduce total memory usage for these larger inputs? We've tried reducing the total number of input files for the merge and reducing cores used without much success. Any tips for tuning/setting this would be very welcome. Thanks so much.
I would first try replacing bammerge with bamcat. If this does not help, then I would need more information about how exactly it fails.
German -- thanks for the tip about bamcat. We swapped over to that and have been testing with it. It helped on some cases that were previously having problems but we're still stuck on some of the difficult large cases. Here's a traceback from one of the dying processes:
https://gist.github.com/chapmanb/c2adc796f63b65a2bca670491f66da69
It looks like it's getting killed by the operating system due to memory usage. In this case we have bamcat piped to bamsomadup with 15 threads on a machine with 48Gb of memory. I'd be happy for any ideas about how best to tweak our approach. Thanks again for the help.
Hi Brad,
could you please try running just bamcat >/dev/null on the files and let me know whether this finishes without getting aborted? If it finishes, please let me know what the memory usage is. The latest version (I have just uploaded binaries) prints this after a succesful run.
Thanks German
German; Thanks for the help debugging and the new version to test with. This does finish successfully and reports:
[V] finished MemUsage(size=54.3125,rss=4.39844,peak=54.5039)
Assuming this is in Mb, it doesn't look like that has a major memory component. If that seems right with you, I guess the memory issues are coming from bamsormadup. Any ideas to debug or tweak would be welcome. Thanks again.
Brad,
would you be able to upload the complete log of a failed run? Just the parts bamsormadup prints on the standard error channel?
German; Thanks again for the help with this and sorry for the delay in getting the log prepared. Here is a full log of a failed run:
To get a slightly cleaner exit than the system killing the process, I ran on a larger memory node and set SLURM limits to 48Gb -- the memory of the standard machines where it appears to die. It looks from the log like de-duplication finishes cleanly but then memory usage increases when starting the sort process.
Happy for any tips on decreasing or controlling memory usage here and can write more temp files as a tradeoff of having the ability to constrain memory. Thanks again.
Brad,
I am not sure which part of the program is using all this memory. I have uploaded a version to diagnose this at https://github.com/gt1/biobambam2/releases/download/2.0.43-release-20160510091354/biobambam2-2.0.43-release-20160510091354-x86_64-etch-linux-gnu-debug.tar.gz . Could you run this while setting the LIBMAUS2_AUTOARRAY_AUTOARRAYMAXMEM parameter to something slightly below the amount allocated by the scheduler?
LIBMAUS2_AUTOARRAY_AUTOARRAYMAXMEM=40g biobambam2-2.0.43-release-20160510091354-x86_64-etch-linux-gnu-debug/bin/bamsormadup < input.bam >/dev/null 2>log
This will be somewhat chatty in the end. You may need to check the log at some point as I am not 100 percent sure the program will terminate once it has printed the debug output.
Thanks German
German; Thanks so much for all the help debugging this. Here's the log file that bamsormadup outputs when exceeding 45Gb of memory usage:
Whew, that's a lot, hope it has some useful information in there. Please let me know if I can help with anything else at all and thanks again.
Hi Brad,
thanks for the output. It gives my an idea where to reduce memory. Could you please retry with the following version?
It reduces the memory used per sorted block if the input is large. Hopefully this will be sufficient to handle your input.
Best, German
On 12.05.2016 11:50, Brad Chapman wrote:
German; Thanks so much for all the help debugging this. Here's the log file that bamsormadup outputs when exceeding 45Gb of memory usage:
Whew, that's a lot, hope it has some useful information in there. Please let me know if I can help with anything else at all and thanks again.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/chapmanb/bcbio-nextgen/issues/971#issuecomment-218710980
German; Brilliant, thank you. This worked on the difficult test dataset I had, giving a 500Gb merged sorted and de-duplicated BAM:
[V] blocks merged in time 03:37:45:16038400
[V] run time 13:32:56:81138099 (48776.8 s) MemUsage(size=24427.2,rss=22614.6,peak=41344.2)
Thanks so much for the help with this. I have some even larger high depth BAM files (1Tb) so will also give those a try with the new code as well. I appreciate all your work on biobambam, it's an incredibly helpful tool.
Great to hear it. Please let me know if you run into trouble with the larger data set. bamsormadup is currently not prepared to run multiple merge passes over the data, which would be required at some point if the input gets too big.
German; The larger inputs worked as well (300x WGS samples from http://www.nature.com/ncomms/2015/151209/ncomms10001/full/ncomms10001.html):
-rw-rw-r-- 1 bchapman collab 807G May 19 18:14 align/MB-normal/MB-normal-sort.bam
-rw-rw-r-- 1 bchapman collab 894G May 20 04:52 align/MB-tumor/MB-tumor-sort.bam
Brilliant work. Thanks so much for the help.
Great. Thanks for letting me know.
Bam merge step results in empty bam file. Perhaps relatedly, the bammerge command produces a bam file with multiple PG lines in the header for samblaster and bwa:
etc.
And bamsormadup produces these errors:
etc.
The result that I see is bcbio apparently finishing the process with
and will then sit forever doing nothing, with the output bam empty.