biod / sambamba

Tools for working with SAM/BAM data
http://thebird.nl/blog/D_Dragon.html
GNU General Public License v2.0
558 stars 104 forks source link

Can not run MarkDup for very large bam file #394

Closed liujiayi771 closed 4 years ago

liujiayi771 commented 5 years ago

If I have a very large bam file, the size of the bam file is 129GB. How should I set the parameters of markdup to run it fast? My server has 64 cores, 256G memory. No error is currently reported, but it seems to be just starting, but not running.

ZunpengLiu commented 5 years ago

I also met the same issue. segmentation fault. Im trying some parameters.

liujiayi771 commented 5 years ago

I also met the same issue. segmentation fault. Im trying some parameters.

I have not encountered the problem of segmentation fault, but my program runs very slowly. I use htop to see that the process has no CPU usage and it seems that it is not running.

ZunpengLiu commented 5 years ago

maybe you can split the large bam to smaller ones and try it again?

vidboda commented 5 years ago

I experienced also the same issue with 0.6.7 and 0.6.9: the program starts, prints: "finding positions of the duplicate reads in the file..." then seems to sleep - it can last for hours. It happened on a variety of BAMs, from hundreds of Mb to 2-5Gb, but not always (works for a bunch of BAMs, then 'blocks' on one). Hardware is Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz. I reverted to 0.6.6 - seems to work for the moment

ZunpengLiu commented 5 years ago

I experienced also the same issue with 0.6.7 and 0.6.9: the program starts, prints: "finding positions of the duplicate reads in the file..." then seems to sleep - it can last for hours. It happened on a variety of BAMs, from hundreds of Mb to 2-5Gb, but not always (works for a bunch of BAMs, then 'blocks' on one). Hardware is Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz. I reverted to 0.6.6 - seems to work for the moment

Cool, I will try it. And, what are the parameters recommended for processing larger bam files?

vidboda commented 5 years ago

In my experience the number of threads had no influence on the issue. From the doc the --hash-table-size can be increased if you have really high coverage. You can maybe try also the buffer size?

ZunpengLiu commented 5 years ago

Yes, I also notcied what you mentioned "threads...". Thanks for you kindly reply!

liujiayi771 commented 5 years ago

I experienced also the same issue with 0.6.7 and 0.6.9: the program starts, prints: "finding positions of the duplicate reads in the file..." then seems to sleep - it can last for hours. It happened on a variety of BAMs, from hundreds of Mb to 2-5Gb, but not always (works for a bunch of BAMs, then 'blocks' on one). Hardware is Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz. I reverted to 0.6.6 - seems to work for the moment

my CPU is Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz, and I try 0.6.6. However, after the program has been running for a while, it still stops running. The process is not finished, but there is no CPU at all.

vidboda commented 5 years ago

Do you use pre-compiled binaries? I was wondering if it was worth giving a try to compile from source.

liujiayi771 commented 5 years ago

Do you use pre-compiled binaries? I was wondering if it was worth giving a try to compile from source.

Have you successfully compiled v0.6.6 from source? I have encountered a lot of problems related to dmd, I am not very familiar with dlang, I don't know how to compile successfully. I use git checkout tags/v0.6.6 to get the source code of v0.6.6, and then use git checkout to switch BioD, lz4, htslib to the specified version used in v0.6.6. After git clone https://github.com/dlang/undeaD, I used command "make" to compile the source code, and I have this problem: "sambamba/utils/common/readstorage.d(22): Error: module stdlib is in file 'std/c/stdlib.d' which cannot be read"

vidboda commented 5 years ago

Have you successfully compiled v0.6.6 from source?

No sorry I did not. And won't be of help with this error.

liujiayi771 commented 5 years ago

I add a parameter --hash-table-size=4194304 and now it runs normally. But I don't how to set this parameter suitably, can someone give me some advice? The official advice is "size of hash table for finding read pairs (default is 262144 reads); will be rounded down to the nearest power of two; should be > (average coverage) * (insert size) for good performance". And I don't know what is average coverage and insert size.

ZunpengLiu commented 5 years ago

I add a parameter --hash-table-size=4194304 and now it runs normally. But I don't how to set this parameter suitably, can someone give me some advice? The official advice is "size of hash table for finding read pairs (default is 262144 reads); will be rounded down to the nearest power of two; should be > (average coverage) * (insert size) for good performance". And I don't know what is average coverage and insert size.

How long time and memory does sambamba need to makedup for 100G bam? And how many core you used? I noticed that no matter how many cores I set, sambamba just use 2-3 cores.

ZunpengLiu commented 5 years ago

finding positions of the duplicate reads in the file...

so long time

ZunpengLiu commented 5 years ago

finding positions of the duplicate reads in the file... sorted 291724792 end pairs and 10283856 single ends (among them 362897 unmatched pairs) collecting indices of duplicate reads... done in 68219 ms found 102567687 duplicates collected list of positions in 41 min 45 sec removing duplicates.... :(. :( :(. :( (core dumped) $sambamba markdup --hash-table-size=4500000 -r -t 20 $srtbam ${insam%.*}.sort.rmdup.bam.