biod / sambamba

Tools for working with SAM/BAM data
http://thebird.nl/blog/D_Dragon.html
GNU General Public License v2.0
557 stars 104 forks source link

Merging already merged BAM files #92

Closed brettChapman closed 7 years ago

brettChapman commented 9 years ago

Hi, I've been trying to merge over 100 BAM files.

First off, I aligned fastq files (both paired and singleton) using STAR to the genome, and generated my BAM files. The data was from a variety of different experiments from public repositories and collaborators. The genome I was working with, is a very large hexaploid plant genome, so I needed to align to each sub-genome, identify matching contigs, and then index and align to a new target genome. This was because STAR could not index the entire genome and alignment time would likely exceed the walltime.

The machine I've been running Sambamba on takes a very long time to merge that many BAM files and hits the walltime.

What I've done is broken down the problem by merging BAM files from the different experiments, into a number of different batches.

From around 100 BAM files, I now have about 11 BAM files.

I tried to merge these 11 BAM files and I now get this error: sambamba-merge: graph contains cycles

Is there a way around this? I tried indexing the BAM files using sambamba and then merging, but the problem remains. I'm about to try and sort the BAM files using sambamba and then merge them again. I had already sorted the BAM files initially by coordinate before the first merge I did, using samtools.

Any help with this would be much appreciated.

Thanks.

Regards

Brett

lomereiter commented 9 years ago

Hi,

Please send me the headers of all these files. Use something along the lines

for bam in `ls *.bam`; do sambamba view -H $bam > $bam.header; done;
tar czf headers.tar.gz *.header
brettChapman commented 9 years ago

Ok thanks. Do you want the headers of all 100 files, or just the 11 merged BAM files, which I'm having trouble merging?

lomereiter commented 9 years ago

Preferably all of them, in order to see how these 11 resulted.

brettChapman commented 9 years ago

I've got sambamba view running on 149 BAM files on our local scratch server. For the 11 BAM files I'm running that on the supercomputer which I've been attempting to merge the BAM files on. They take up a lot of space, and there isn't enough space on our local server. Running on the supercomputer means I'll have to wait in the queue. I should have the 149 headers ready later today or tomorrow, but the 11, since they're on the queue, may take longer. How should I get these header files to you?

lomereiter commented 9 years ago

OK, you can send only 149 headers. If possible, specify also how they were packed into batches so that I can reproduce the problem. Upload them to Dropbox/Google Drive/etc. or use one of file sharing services (e.g. http://wikisend.com/)

brettChapman commented 9 years ago

The 149 headers are ready. 766Mb tar gzip file.

brettChapman commented 9 years ago

Will send shortly

brettChapman commented 9 years ago

We have our own sftp site. I'll upload to there and give you the username and password. I'm working from home, so if I were to use dropbox etc I'd have to download before uploading, which could take a while.

brettChapman commented 9 years ago

I'll delete the comment after you've downloaded. Rather not keep the username and password posted online.

brettChapman commented 9 years ago

Here is what I essentially used to merge them in batches. Similar naming BAM files were merged into a batch.

So files like: iwgscWheatGenome_SRR831586_SE_align_pass2Aligned.out.sam.sorted.bam iwgscWheatGenome_SRR831587_SE_align_pass2Aligned.out.sam.sorted.bam iwgscWheatGenome_SRR831588_SE_align_pass2Aligned.out.sam.sorted.bam iwgscWheatGenome_SRR831589_SE_align_pass2Aligned.out.sam.sorted.bam iwgscWheatGenome_SRR831590_SE_align_pass2Aligned.out.sam.sorted.bam

are one batch.

The script used for running on each batch:

!/bin/bash

SUBMIT_FILE1="Submit_MergeBam_batch1.sh" mergeString="" inputString=" "

for i in ls *.sam.sorted.bam;do

mergeString=$mergeString$inputString$i

done

echo -e "#!/bin/bash -l" > $SUBMIT_FILE1 echo -e ". /etc/profile.d/modules.sh" >> $SUBMIT_FILE1 echo -e "cd \$SLURM_SUBMIT_DIR" >> $SUBMIT_FILE1

echo -e "module use /ivec/$IVEC_OS/modulefiles/bio-apps" >> $SUBMIT_FILE1 echo -e "module load sambamba" >> $SUBMIT_FILE1

echo -e "sambamba merge -t 64 -l 5 -p Wheat_RNAseq.sorted.bam$mergeString" >> $SUBMIT_FILE1

job1=sbatch -c 64 --account=director840 --ntasks=1 --mem=1536G --time=24:00:00 $SUBMIT_FILE1 echo $job1

lomereiter commented 9 years ago

Thanks, you can delete the comment with the password, I'm now downloading the file.

brettChapman commented 9 years ago

Files like: iwgscWheatGenome_leaf_Z10_rep2_PE_align_pass2Aligned.out.sam.sorted.bam iwgscWheatGenome_leaf_Z10_rep2_SE_align_pass2Aligned.out.sam.sorted.bam iwgscWheatGenome_leaf_Z23_rep1_PE_align_pass2Aligned.out.sam.sorted.bam iwgscWheatGenome_leaf_Z23_rep1_SE_align_pass2Aligned.out.sam.sorted.bam iwgscWheatGenome_leaf_Z23_rep2_PE_align_pass2Aligned.out.sam.sorted.bam iwgscWheatGenome_leaf_Z23_rep2_SE_align_pass2Aligned.out.sam.sorted.bam iwgscWheatGenome_leaf_Z71_rep1_PE_align_pass2Aligned.out.sam.sorted.bam iwgscWheatGenome_leaf_Z71_rep1_SE_align_pass2Aligned.out.sam.sorted.bam iwgscWheatGenome_leaf_Z71_rep2_PE_align_pass2Aligned.out.sam.sorted.bam iwgscWheatGenome_leaf_Z71_rep2_SE_align_pass2Aligned.out.sam.sorted.bam

So all from leaf are one batch, etc.

But, files like:

iwgscWheatGenome_grain_5_SE_align_pass2Aligned.out.sam.sorted.bam iwgscWheatGenome_grain_6_SE_align_pass2Aligned.out.sam.sorted.bam iwgscWheatGenome_grain_7_SE_align_pass2Aligned.out.sam.sorted.bam iwgscWheatGenome_grain_8_SE_align_pass2Aligned.out.sam.sorted.bam iwgscWheatGenome_leaves_5_SE_align_pass2Aligned.out.sam.sorted.bam iwgscWheatGenome_leaves_6_SE_align_pass2Aligned.out.sam.sorted.bam iwgscWheatGenome_leaves_7_SE_align_pass2Aligned.out.sam.sorted.bam iwgscWheatGenome_leaves_8_SE_align_pass2Aligned.out.sam.sorted.bam

I kept in one batch, they were quite small being SE, so I could easily merge them within the walltime.

brettChapman commented 9 years ago

ok thanks

brettChapman commented 9 years ago

Those 11 BAM files haven't started running yet. Let me know if you need them, and I'll make them available when they're ready. I'm going to head off now. I'll be back on later. Thanks for the help with this.

brettChapman commented 9 years ago

The 11 BAM headers are finished. I'll upload to the same place. Just access it like last time, changing the file name to 11_headers.tar.gz. Thanks.

lomereiter commented 9 years ago

You can try https://github.com/lomereiter/sambamba/releases/download/latest/sambamba_latest_linux.tar.bz2 It uses indexes in complicated cases like this.

_BUT_ the amount of computational work is likely to be the same as in the case if you merged them the 149 files. That's because whether you merge 11 files or 149, the same amount of records is read & written. If you want to cut down running time, you should reduce compression level.

For example, recompressing a BAM file on my 2-core laptop:

compression level real time, s user time, s file size, Mb
0 8.1 8.4 960
1 13.6 23.2 333
2 14.3 25.3 323
3 17.8 31.8 313
4 19.0 34.1 298
5 26.0 47.3 288
6 (default) 37.6 68.4 283

As you can see, compression is the most significant factor here. There're also 7-9 levels but they are almost unused.

brettChapman commented 9 years ago

Ok. Thanks for the help with this. I'll let you know how it goes.

brettChapman commented 9 years ago

I've tried running with all 149 BAM files. I'm 20 hours into a 24 hour walltime now, and I've seen no progress. Even the progress bar hasn't been outputted yet. For this I was running with compression level 2. I'm now going to try with 2 BAM files only, and also the 11 already merged BAM files to see if it makes any difference. If I were to index each BAM file before merging, would that make any difference? I'll also try with compression level 0 if nothing else works, otherwise I may have to find somewhere to run this without a walltime to see if its simply not running for long enough, regardless of compression level.

pjotrp commented 9 years ago

Hi Brett,

It appears to me that you need to narrow down on the actual problem. There are too many factors. I would start with less files and make them shorter (slice them) and merge. Walltime may certainly be an issue - the error log/value of the process should be indicative. You could use a qlogin if you have that available.

brettChapman commented 9 years ago

Ok thanks. I'll look into using slice. The genome is highly fragmented, so cutting up the contigs further will unlikely help. Is there a way to slice out just the entire contig without having to specify the regions inside the contig? Otherwise I'll need to determine the contig lengths each time I slice up a BAM file. I did work with small numbers of files to begin with from the same experiment. The problem arises when I try to Merge BAM files from different experiments. So If I were to extract (slice) each contig, then perhaps merging the same contigs together across different experiments may resolve the issue.

Unfortunately there is no qlogin.

lomereiter commented 9 years ago

I didn't yet implement progress bar for this scenario, so just look at the size of the output file and see how it compares with the total size of input files, this should give an idea of how much wall time is required.

brettChapman commented 9 years ago

I just saw an example elsewhere, where only the chromosome is specified to slice out, no regions specified. I'm going to use the BAM headers to get a list of contigs, then will slice out each contig from each of the 11 already merged BAM files. I will then merge all experiments based on contigs to generate a BAM file for each contig. If that works, then merging BAM files from all contigs should be straight forward.

brettChapman commented 9 years ago

Ok. I did specify -p and didn't receive an error. There is no BAM file yet being outputted after 20 hours. Once it finishes, I'll run again without -p just in case.

brettChapman commented 9 years ago

Even though I've got no BAM file output, its doing something. 100% CPU usage with 1.1% memory usage (total memory on the machine is 6Tb)

134180 bchapman 20 0 62.5g 60g 1120 R 100 1.1 1435:12 sambamba_latest

brettChapman commented 9 years ago

The run has now finished. No BAM output:

bchapman@zythos:/scratch/director840/bchapman/Wheat/Wheat_BAM_files> cat slurm-12717.out slurmd[zythos]: * JOB 12717 CANCELLED AT 2014-08-31T16:32:49 DUE TO TIME LIMIT *

pjotrp commented 9 years ago

You have hit your walltime, it appears. Are you on the submit machine - there may be limitations. Also your cluster may need to allow you to use multiple cores. Those are PBS issues and may have qsub options.

Right now, I just merged 80 BAM files on 16 cores. Sambamba is doing its job.

brettChapman commented 9 years ago

Yes, the walltime was hit because no BAM files were generated to completion in time. I'm running on a cluster with shared resources. The machine can be seen here: http://www.ivec.org/systems/zythos/. The machine uses SLURM and not PBS Pro. I've run with multiple cores with 64 threads and over 1Tb of RAM as can be seen in my submission script I pasted here last week. I've run with these parameters before with around 10-40 BAM files from 1 experiment, with no problems. That is how I generated the 11 already merged BAM files. Therefore the problem is likely in how sambamba is trying to merge the different BAM files from the different experiments. Possibly the number of records is so large that the processing time before any merging begins is longer than the walltime, but I would have expected to at least get a partially generated merged BAM file. I will try with a few different options, and then try slicing, and if all fails will try on a different machine if I'm able to find one with enough disk space to process them all. The Zythos machine has 3Pb of storage, so storage has not been a problem there.

pjotrp commented 9 years ago

Using

    sbatch -c 64 --account=director840 --ntasks=1 --mem=1536G --time=24:00:00

I don't use SLURM but my guess is that you get 24hrs walltime divided by 64. Which is rather short :). Try using -c 8 first (sambamba effectively uses 8 cores fully because IO tends to lag behind). Otherwise increase the time limit. I am not sure about the implications of --ntasks=1. Maybe that number should be higher.

pjotrp commented 9 years ago

Yes, --ntasks should be 1 for one node. And yes, sambamba starts off using one core. If the walltime runs out there wil be no output.

lomereiter commented 9 years ago

@pjotrp, your note about merging 80 BAM files is irrelevant here. The trouble is that headers are HUGE, each of these 149 files has around 1 million "reference sequences", and the code was never optimized for such huge number of them. In particular, GC totally sucks with that many pointers floating around. I have no choice but to simply disable it for the initialization phase and re-enable later. With this in place, the program was able to complete header merging in 12 min, using over 80Gb of RAM.

lomereiter commented 9 years ago

@Brett-CCG I've uploaded new executables, try merging these 149 files again (with -p option).

brettChapman commented 9 years ago

Thanks. Should I download the latest release like last time from here? https://github.com/lomereiter/sambamba/releases/download/latest/sambamba_latest_linux.tar.bz2

brettChapman commented 9 years ago

I downloaded the latest version. Good so far, I'm seeing a BAM file being generated. 2.5G generated so far and 30 min into a 24 hour walltime. It will likely finish within the walltime. I'll let you know how it goes.

pjotrp commented 9 years ago

@lomereiter awesome! I think we can allow disabling GC with a switch anyway. May make sambamba even faster here and there, at the expense of RAM use.

brettChapman commented 9 years ago

The run has around 4 hours left. I decided to run using compression level 5, as a smaller BAM file will be more workable for later analysis I'm doing, due to space limits on the machine I'll be transferring it to. If it hits the walltime before it completes, would adding more RAM to the job improve the speed? Since GC has now been disabled? Adding more RAM wouldn't be a problem with Zythos, as it is a specialty machine built for such large memory intensive tasks.

lomereiter commented 9 years ago

What is the total size of input files, and what is the size of the incomplete BAM file?

brettChapman commented 9 years ago

The original BAM files total 262Gb and the incomplete BAM file is 182Gb. The progress bar looks about 60-65% of the way through before it reached the walltime.

brettChapman commented 9 years ago

I ran with 64 threads and 1536Gb RAM. I can add a bit more to the RAM. The machine has a total of 6Tb RAM.

lomereiter commented 9 years ago

The figures you give are not at all normal. The speed of writing data is 182Gb / 24h = 2Mb/sec which is very low. In the latest commit I tried to get rid of GC allocations as much as possible, it now doesn't leak as much with disabled GC (in my tests, about 15Mb per 1Gb of output). I've updated the binaries, please try again.

brettChapman commented 9 years ago

Thanks. That last run caused some problems with the job not coming off the queue after it hit the walltime. I assume because a lot of data was still in memory and possibly due to sambamba leaking memory. I'll let you know how the next run goes when I'm given the go ahead to start the next run after they've investigated the cause of the problem.

brettChapman commented 9 years ago

I've just been told by the sys admin of Zythos not to run the job again! The problem also locked out my directory and I can't access it again until maintenance in 2 week when they reboot the machine. Would I be able to run my job on a machine with less RAM, say 80Gb or so? This is a huge bottleneck for me. I won't be able to move forward and its an important component for my thesis to include this data.

brettChapman commented 9 years ago

I'm going to consult my supervisor and see if something can be arranged, even if its to run it on a AWS cluster. I'll let you know how it goes. Thanks for your help with this.

brettChapman commented 9 years ago

What would be the minimum requirement for Sambamba, given that GC is disabled, to process 262Gb of BAM files? I need to come up with a specification so we can start up a AWS cluster. Hopefully we'll be able to run it within around 24 hours. Could there be a trade off between RAM, the number of threads and run time? As the price point on AWS for larger RAM machines may be higher than smaller ones with longer run times.

lomereiter commented 9 years ago

For the latest version 122GB of RAM should be enough. I'd go with r3.4xlarge storing output to a EBS volume.

pjotrp commented 9 years ago

I've just been told by the sys admin of Zythos not to run the job again! The problem also locked out my directory and I can't access it again until maintenance in 2 week when they reboot the machine.

I would not accept that and complain to the administration. If Zythos can not handle load it should not allow users to usurp such resources. It is not your fault, nor sambamba's. You are probably the first user to really hit the system - deserves a medal and they should fix your home dir or give you a new one :)

brettChapman commented 9 years ago

I agree. I'll have to discuss it with my supervisor.

I've been told that 122Gb is quite a lot of RAM and may limit our options.

How many threads should we run Sambamba with, when running on the AWS cluster? I assume the more threads the quicker the job would complete, so we could be flexible a bit if its more cost effective to run for longer.

lomereiter commented 9 years ago

To be honest, I would start with a question why merge these files at all. I don't see why further analysis can't just operate on multiple BAM files.

Lots of data require either lots of RAM or lots of programmer's time, and for one-off tasks scientists usually choose the former. The reason is simple: an hour of r3.4xlarge on EC2 costs $1.4, and if the task takes less than a day to complete, spending 2 hours to code a better solution can hardly be justified.

brettChapman commented 9 years ago

That's true, but the analysis I'm doing requires a single BAM file from multiple large sources of RNA-seq analysis. It extracts sequences which span a splice region. Any splice regions which appear more than twice in the BAM file are considered for analysis. The spliced region is then interpreted as a peptide sequence used in proteogenomics analysis. The tool I'm using is the one developed in this paper: http://pubs.acs.org/doi/abs/10.1021/pr400294c

pjotrp commented 9 years ago

The number of threads depends a bit on IO speed. From my measurements merge pans out at 10 threads, unless you have very fast IO (and that appears to be rather doubtful). Take 16 threads and you should outperform all systems. Maybe the problem goes away too because there will be less resource contention.

lomereiter commented 9 years ago

@Brett-CCG did you try to run the tool on one of the smaller BAM files? It's a good idea to do that, in order to be prepared for possible later issues. I downloaded its source code from http://proteomics.ucsd.edu/software-tools/ (second link from the bottom), and I notice that:

That is, it doesn't care about neither order of records nor read groups, and what you should start with is simply convert all BAM files into SAM and concatenate them.