bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 353 forks source link

Bam merge failing #971

Closed funnell closed 9 years ago

funnell commented 9 years ago

Bam merge step results in empty bam file. Perhaps relatedly, the bammerge command produces a bam file with multiple PG lines in the header for samblaster and bwa:

@PG     ID:bwa_0        PN:bwa  CL:/home/tfunnell/local/software/bcbio_nextgen/bin/bwa mem -c 250 -M -t 8 -R @RG\tID:9\tPL:illumina\tPU:9_2015-07-24_genome_instability_bwamem\tSM:SA699_A
@PG     ID:SAMBLASTER_0 CL:samblaster -i stdin -o stdout -M -d /dev/fd/62 -s /dev/fd/63 --maxSplitCount 2 --maxUnmappedBases 50 --minIndelSize 50 --minNonOverlap 20    VN:0.1.22
@PG     ID:bwa_1        PN:bwa  CL:/home/tfunnell/local/software/bcbio_nextgen/bin/bwa mem -c 250 -M -t 8 -R @RG\tID:9\tPL:illumina\tPU:9_2015-07-24_genome_instability_bwamem\tSM:SA699_A
@PG     ID:SAMBLASTER_1 CL:samblaster -i stdin -o stdout -M -d /dev/fd/62 -s /dev/fd/63 --maxSplitCount 2 --maxUnmappedBases 50 --minIndelSize 50 --minNonOverlap 20    PP:SAMBLASTER_0 VN
@PG     ID:bwa_2        PN:bwa  CL:/home/tfunnell/local/software/bcbio_nextgen/bin/bwa mem -c 250 -M -t 8 -R @RG\tID:9\tPL:illumina\tPU:9_2015-07-24_genome_instability_bwamem\tSM:SA699_A
@PG     ID:SAMBLASTER_2 CL:samblaster -i stdin -o stdout -M -d /dev/fd/62 -s /dev/fd/63 --maxSplitCount 2 --maxUnmappedBases 50 --minIndelSize 50 --minNonOverlap 20    PP:SAMBLASTER_1 VN

etc.

And bamsormadup produces these errors:

WARNING: SAM header designates more than one PG tree root by PP tags.
WARNING: PG line with id SAMBLASTER_0 has multiple children referencing it by PP tags.
WARNING: PG line with id SAMBLASTER_1 has multiple children referencing it by PP tags.

etc.

The result that I see is bcbio apparently finishing the process with

flushing read ends lists...done.

and will then sit forever doing nothing, with the output bam empty.

chapmanb commented 9 years ago

Tyler; Sorry about the problems. I'm not sure why the merge is failing but the warnings and multiple PG lines are normal. They're just the multiple command lines from each of the individual chunks and are present as you posted in files that work.

We'll probably need more details to be able to diagnose the underlying issue. How many chunks was the alignment split into and feed into the merge? My best guess without more information is that the worker node doing the merge runs out of memory and is killed by the scheduler or machine so bcbio doesn't know that it died. If you could look at the scheduler logs for the machine where the merge happened they might provide more details.

Thanks for helping to debug and hope this is helpful.

funnell commented 9 years ago

Hi Brad, thank you very much for the help. After digging further I think I've found the problem. Here is a gist of the output. It looks like you were right about running out of memory. https://gist.github.com/funnell/da8c3d9b22d3c6468c8f Do you have any rules of thumb for memory allocation?

chapmanb commented 9 years ago

Tyler; Glad that helps explain the issue. I haven't had this fail so I'm not totally sure. cc'ing in @gt1 to see if he has an idea. German, we're running bammerge and bamsormadup and running out of memory, are there any estimates for memory usage per size/number of files and ways to create more temporary files to reduce usage? This is the code in bcbio:

https://github.com/chapmanb/bcbio-nextgen/blob/aedbbb42d0ddd7c66fda63719c172455a23358fe/bcbio/pipeline/merge.py#L92

Tyler, how much memory does the machine that it failed on have? How many chunks are you merging? Sorry about the problem but hopefully we can improve it by digging into it.

gt1 commented 9 years ago

The amount of memory used by bamsormadup will increase with every thread used. As a rough guideline I would assume it needs about 1GB of RAM per thread.

funnell commented 9 years ago

Ok bumping up the available memory did the trick. Took about 30G. Thanks!

chapmanb commented 8 years ago

@gt1 -- we've been running into this memory issue with bammerge piped to bamsormadup on larger file sizes and would love thoughts on setting cores/usage to reduce memory usage. We're trying to run this on a large input file: 27, 25Gb input files for a total output BAM size of just under 700Gb. It's run on a machine with 48Gb of memory and gets killed by the operating system due to memory usage. Is is possible to tweak intermediate file sizes or other things to reduce total memory usage for these larger inputs? We've tried reducing the total number of input files for the merge and reducing cores used without much success. Any tips for tuning/setting this would be very welcome. Thanks so much.

gt1 commented 8 years ago

I would first try replacing bammerge with bamcat. If this does not help, then I would need more information about how exactly it fails.

chapmanb commented 8 years ago

German -- thanks for the tip about bamcat. We swapped over to that and have been testing with it. It helped on some cases that were previously having problems but we're still stuck on some of the difficult large cases. Here's a traceback from one of the dying processes:

https://gist.github.com/chapmanb/c2adc796f63b65a2bca670491f66da69

It looks like it's getting killed by the operating system due to memory usage. In this case we have bamcat piped to bamsomadup with 15 threads on a machine with 48Gb of memory. I'd be happy for any ideas about how best to tweak our approach. Thanks again for the help.

gt1 commented 8 years ago

Hi Brad,

could you please try running just bamcat >/dev/null on the files and let me know whether this finishes without getting aborted? If it finishes, please let me know what the memory usage is. The latest version (I have just uploaded binaries) prints this after a succesful run.

Thanks German

chapmanb commented 8 years ago

German; Thanks for the help debugging and the new version to test with. This does finish successfully and reports:

        [V] finished MemUsage(size=54.3125,rss=4.39844,peak=54.5039)

Assuming this is in Mb, it doesn't look like that has a major memory component. If that seems right with you, I guess the memory issues are coming from bamsormadup. Any ideas to debug or tweak would be welcome. Thanks again.

gt1 commented 8 years ago

Brad,

would you be able to upload the complete log of a failed run? Just the parts bamsormadup prints on the standard error channel?

chapmanb commented 8 years ago

German; Thanks again for the help with this and sorry for the delay in getting the log prepared. Here is a full log of a failed run:

https://gist.githubusercontent.com/chapmanb/c2adc796f63b65a2bca670491f66da69/raw/8ecba43908d4eaa13e491f366d27954f7910629c/bamsormadup_full.log

To get a slightly cleaner exit than the system killing the process, I ran on a larger memory node and set SLURM limits to 48Gb -- the memory of the standard machines where it appears to die. It looks from the log like de-duplication finishes cleanly but then memory usage increases when starting the sort process.

Happy for any tips on decreasing or controlling memory usage here and can write more temp files as a tradeoff of having the ability to constrain memory. Thanks again.

gt1 commented 8 years ago

Brad,

I am not sure which part of the program is using all this memory. I have uploaded a version to diagnose this at https://github.com/gt1/biobambam2/releases/download/2.0.43-release-20160510091354/biobambam2-2.0.43-release-20160510091354-x86_64-etch-linux-gnu-debug.tar.gz . Could you run this while setting the LIBMAUS2_AUTOARRAY_AUTOARRAYMAXMEM parameter to something slightly below the amount allocated by the scheduler?

LIBMAUS2_AUTOARRAY_AUTOARRAYMAXMEM=40g biobambam2-2.0.43-release-20160510091354-x86_64-etch-linux-gnu-debug/bin/bamsormadup  < input.bam >/dev/null 2>log

This will be somewhat chatty in the end. You may need to check the log at some point as I am not 100 percent sure the program will terminate once it has printed the debug output.

Thanks German

chapmanb commented 8 years ago

German; Thanks so much for all the help debugging this. Here's the log file that bamsormadup outputs when exceeding 45Gb of memory usage:

https://gist.githubusercontent.com/chapmanb/c2adc796f63b65a2bca670491f66da69/raw/c0f1352f57fc478bbb54ab4e46dbb84260b91643/bamsormadup_LIBMAUS2_AUTOARRAY_AUTOARRAYMAXMEM.log

Whew, that's a lot, hope it has some useful information in there. Please let me know if I can help with anything else at all and thanks again.

gt1 commented 8 years ago

Hi Brad,

thanks for the output. It gives my an idea where to reduce memory. Could you please retry with the following version?

https://github.com/gt1/biobambam2/releases/download/2.0.44-release-20160517104101/biobambam2-2.0.44-release-20160517104101-x86_64-etch-linux-gnu.tar.gz

It reduces the memory used per sorted block if the input is large. Hopefully this will be sufficient to handle your input.

Best, German

On 12.05.2016 11:50, Brad Chapman wrote:

German; Thanks so much for all the help debugging this. Here's the log file that bamsormadup outputs when exceeding 45Gb of memory usage:

https://gist.githubusercontent.com/chapmanb/c2adc796f63b65a2bca670491f66da69/raw/c0f1352f57fc478bbb54ab4e46dbb84260b91643/bamsormadup_LIBMAUS2_AUTOARRAY_AUTOARRAYMAXMEM.log

Whew, that's a lot, hope it has some useful information in there. Please let me know if I can help with anything else at all and thanks again.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/chapmanb/bcbio-nextgen/issues/971#issuecomment-218710980

chapmanb commented 8 years ago

German; Brilliant, thank you. This worked on the difficult test dataset I had, giving a 500Gb merged sorted and de-duplicated BAM:

[V] blocks merged in time 03:37:45:16038400
[V] run time 13:32:56:81138099 (48776.8 s)      MemUsage(size=24427.2,rss=22614.6,peak=41344.2)

Thanks so much for the help with this. I have some even larger high depth BAM files (1Tb) so will also give those a try with the new code as well. I appreciate all your work on biobambam, it's an incredibly helpful tool.

gt1 commented 8 years ago

Great to hear it. Please let me know if you run into trouble with the larger data set. bamsormadup is currently not prepared to run multiple merge passes over the data, which would be required at some point if the input gets too big.

chapmanb commented 8 years ago

German; The larger inputs worked as well (300x WGS samples from http://www.nature.com/ncomms/2015/151209/ncomms10001/full/ncomms10001.html):

-rw-rw-r-- 1 bchapman collab  807G May 19 18:14 align/MB-normal/MB-normal-sort.bam
-rw-rw-r-- 1 bchapman collab  894G May 20 04:52 align/MB-tumor/MB-tumor-sort.bam

Brilliant work. Thanks so much for the help.

gt1 commented 8 years ago

Great. Thanks for letting me know.