biod / sambamba

Tools for working with SAM/BAM data
http://thebird.nl/blog/D_Dragon.html
GNU General Public License v2.0
559 stars 104 forks source link

Add markdup to wiki #57

Closed pjotrp closed 7 years ago

isthisthat commented 10 years ago

Could you please also document whether sambamba-markdup removes previous dup marks? For example in picard I used to run RevertSam before MarkDuplicates. Thanks a lot.

lomereiter commented 10 years ago

@isthisthat yes, it does. The behavior is deliberately the same as in Picard. Namely, reads with 'secondary alignment' flag are not touched at all, and for all other reads previous flag is rewritten.

isthisthat commented 10 years ago

So after downsampling with sambamba view -s I only need to run sambamba markdup in order to re-mark duplicates in the downsampled bam correct? Thanks again for you help.

lomereiter commented 10 years ago

Yes, no additional steps are needed. You're right that more details should be mentioned in docs, I didn't know what Picard does until I delved into its source code.

isthisthat commented 10 years ago

@lomereiter Thanks a lot! Perhaps it's worth advertising that marking duplicates with Picard takes ~13hrs (RevertSam.jar + MarkDuplicates.jar) and 1hr with sambamba (on a 30x bam) !!!! :)

lomereiter commented 10 years ago

I believe RevertSam step can be safely omitted with MarkDuplicates, too. BTW I took some ideas from biobambam paper. They have a duplicate marking tool, too, but I was too lazy to check how it performs, should be pretty fast as well: https://github.com/gt1/biobambam/

isthisthat commented 10 years ago

Thanks for the info! One more question: I guess I need to re-index the bam after markdup correct? Our variant caller has been failing on the markdup'ed bam and I'm guessing this is why.

lomereiter commented 10 years ago

Currently, yes. Later I will add an option to create index while writing BAM (https://github.com/lomereiter/sambamba/issues/54).

isthisthat commented 10 years ago

Hello again, we've been running sambamba's markdup routinely and got some consistent failures in some of the runs (i.e. they always fail with the same message). The command line looks like:

sambamba_v0.4.3 markdup -t 8 --tmpdir /path/to/tmp in.bam out.bam

and the error message reads:

finding positions of the duplicate reads in the file...
Cannot open or create file '/path/to/tmp/sorted.508.bam'

Unfortunately I cannot point you to the input bam because it is proprietary data but I was wondering if you had any advice. We are running on an 8-core machine, requesting 27G of RAM for a 44G bam file (which was downsampled using sambamba view -s) Any help would be greatly appreciated. Thanks a lot.

lomereiter commented 10 years ago

@isthisthat thanks for the report, I guess you're hitting the system limit on the number of files opened simultaneously. Try to increase --hash-table-size and --overflow-list-size. By the way, what are the coverage and insert size?

isthisthat commented 10 years ago

@lomereiter thanks for the quick response! It's a 30x bam (125bp paired-end) with a mode insert size of 375bp. The default --hash-table-size is 262,144. In my case 375 * 30 = 11,250 so I should be well within the limit right? What values would you recommend in my case? Thanks again.

lomereiter commented 10 years ago

Hmm, the default values should be fine, then. Increasing --overflow-list-size should help, but it's weird that there're so many temporary files. Do paired reads have the same name?

isthisthat commented 10 years ago

They do. I'll try with --overflow-list-size 500000 I also noticed that this particular bam has a very high number of duplicates (42%). Perhaps this plays a role? And another small remark: for the same downsampling fraction, sambamba gives me a 31.5x build whereas samtools give me a 30x. Are you perhaps filtering out certain reads from the downsampling? Thanks a lot.

lomereiter commented 10 years ago

42% is indeed very strange for paired reads. Maybe there are some huge spikes in coverage?

Downsampling involves some randomness and relies on statistical properties of read name hashes, and I recently changed the hash function in sambamba to a better one in this respect, that may cause such a difference.

isthisthat commented 10 years ago

Yup, --overflow-list-size 500000 seems to do the trick. By the way, is there a new release coming soon ? :smiley:

lomereiter commented 10 years ago

Nope, I have almost no spare time until February due to exams.

isthisthat commented 10 years ago

Hi Artem, I was wondering how markdup handles single-end reads. Does it mark reads with the same start position and CIGAR string? If not, can this be a feature request :) Thank you.

lomereiter commented 10 years ago

Hi, No, it looks only at starting position and strand, duplicating the behaviour of Picard. I'm well aware that this approach is too simplistic and throws away a lot of information. Comparing CIGARs indeed might do the job if reads aren't too noisy. In short, I'll put it on my todo list.

pjotrp commented 7 years ago

closed due to inactivity