Closed pjotrp closed 7 years ago
@isthisthat yes, it does. The behavior is deliberately the same as in Picard. Namely, reads with 'secondary alignment' flag are not touched at all, and for all other reads previous flag is rewritten.
So after downsampling with sambamba view -s I only need to run sambamba markdup in order to re-mark duplicates in the downsampled bam correct? Thanks again for you help.
Yes, no additional steps are needed. You're right that more details should be mentioned in docs, I didn't know what Picard does until I delved into its source code.
@lomereiter Thanks a lot! Perhaps it's worth advertising that marking duplicates with Picard takes ~13hrs (RevertSam.jar + MarkDuplicates.jar) and 1hr with sambamba (on a 30x bam) !!!! :)
I believe RevertSam step can be safely omitted with MarkDuplicates, too.
BTW I took some ideas from biobambam
paper. They have a duplicate marking tool, too, but I was too lazy to check how it performs, should be pretty fast as well: https://github.com/gt1/biobambam/
Thanks for the info! One more question: I guess I need to re-index the bam after markdup correct? Our variant caller has been failing on the markdup'ed bam and I'm guessing this is why.
Currently, yes. Later I will add an option to create index while writing BAM (https://github.com/lomereiter/sambamba/issues/54).
Hello again, we've been running sambamba's markdup routinely and got some consistent failures in some of the runs (i.e. they always fail with the same message). The command line looks like:
sambamba_v0.4.3 markdup -t 8 --tmpdir /path/to/tmp in.bam out.bam
and the error message reads:
finding positions of the duplicate reads in the file...
Cannot open or create file '/path/to/tmp/sorted.508.bam'
Unfortunately I cannot point you to the input bam because it is proprietary data but I was wondering if you had any advice. We are running on an 8-core machine, requesting 27G of RAM for a 44G bam file (which was downsampled using sambamba view -s) Any help would be greatly appreciated. Thanks a lot.
@isthisthat thanks for the report, I guess you're hitting the system limit on the number of files opened simultaneously. Try to increase --hash-table-size
and --overflow-list-size
.
By the way, what are the coverage and insert size?
@lomereiter thanks for the quick response! It's a 30x bam (125bp paired-end) with a mode insert size of 375bp. The default --hash-table-size
is 262,144. In my case 375 * 30 = 11,250 so I should be well within the limit right?
What values would you recommend in my case?
Thanks again.
Hmm, the default values should be fine, then. Increasing --overflow-list-size
should help, but it's weird that there're so many temporary files. Do paired reads have the same name?
They do. I'll try with --overflow-list-size
500000
I also noticed that this particular bam has a very high number of duplicates (42%). Perhaps this plays a role?
And another small remark: for the same downsampling fraction, sambamba gives me a 31.5x build whereas samtools give me a 30x. Are you perhaps filtering out certain reads from the downsampling?
Thanks a lot.
42% is indeed very strange for paired reads. Maybe there are some huge spikes in coverage?
Downsampling involves some randomness and relies on statistical properties of read name hashes, and I recently changed the hash function in sambamba to a better one in this respect, that may cause such a difference.
Yup, --overflow-list-size 500000
seems to do the trick.
By the way, is there a new release coming soon ? :smiley:
Nope, I have almost no spare time until February due to exams.
Hi Artem, I was wondering how markdup handles single-end reads. Does it mark reads with the same start position and CIGAR string? If not, can this be a feature request :) Thank you.
Hi, No, it looks only at starting position and strand, duplicating the behaviour of Picard. I'm well aware that this approach is too simplistic and throws away a lot of information. Comparing CIGARs indeed might do the job if reads aren't too noisy. In short, I'll put it on my todo list.
closed due to inactivity
Could you please also document whether sambamba-markdup removes previous dup marks? For example in picard I used to run RevertSam before MarkDuplicates. Thanks a lot.