biod / sambamba

Tools for working with SAM/BAM data
http://thebird.nl/blog/D_Dragon.html
GNU General Public License v2.0
557 stars 104 forks source link

Too many files for merge #440

Closed sainadfensi closed 4 years ago

sainadfensi commented 4 years ago

sambamba/0.7.0

Hi,

I've seen a similar issue raised for markdup and you suggest to increase the limit of number of files or increase the memory. However, I cannot get enough number by adjusting the ulimit and there is no similar option in sambamba-merge to increase the hash table size or list size.

I firstly thought maybe I could merge only part of files and then merge the merged files together to eventually get one bam file, but it would be wasting time and computation, right?

Can I have some suggestions from you?

Many Thanks, Jue

mschilli87 commented 4 years ago

@sainadfensi

I firstly thought maybe I could merge only part of files and then merge the merged files together to eventually get one bam file, but it would be wasting time and computation, right?

What makes you think that? Did you test it with a subset of files for which you can handle the single merge? Computationally, I don't think there would be a big penalty. But I didn't check the actual implementaion of sambamba merge. The main performance hit might be I/O as you have to write the intermediate results to disk and read them back. If all your files fit into RAM, you could store the partial merge results on a ramdisk/tmpfs. Otherwise, your single merge would have to write some sort of intermediate results to disk anyways, so the performance penalty of the stepwise merge would be less to begin with. But maybe I'm just not seing something. It's still quite early here. :wink:

sainadfensi commented 4 years ago

Thank you. It's good to know it's feasible to merge files by separate steps. I had the wrong idea because I thought the tool is going to search the pair read of each read throughout inputs. I'll close this issue.