broadinstitute / picard

A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.
https://broadinstitute.github.io/picard/
MIT License
977 stars 369 forks source link

Is it possible to make Picard MarkDuplicates to support multiple threads? #801

Closed liugui closed 7 years ago

liugui commented 7 years ago

I use picard to mark duplicates and it works well. However, when the input BAM file gets bigger, picard gets much more slower. So I wonder if it's possible to make Picard MarkDuplicates to support multiple threads? I read the source code and found that there are two steps to mark duplicates: Sort and Mark. If it's possible, where should I change the source code? Any reply will be much appreciated!

yfarjoun commented 7 years ago

Thanks for your interest!

The major slowness of markduplicates is due to IO and coding/decoding the bam file. The biggest gains for multithreading would be for a system whereby a threadpool can be allocated for coding the bam into in-memory buffers, which are then written to disk by a single thread using a buffered stream. This will have to be done in htsjdk, not picard. If you feel up to the task, be my guest, but it will be arduous and the review process will be slow due to the difficulty of testing multi threaded code.

Cheers!

On Wed, Apr 26, 2017 at 4:17 AM, liugui notifications@github.com wrote:

I use picard to mark duplicates and it works well. However, when the input BAM file gets bigger, picard gets much more slower. So I wonder if it's possible to make Picard MarkDuplicates to support multiple threads? I read the source code and found that there are two steps to mark duplicates: Sort and Mark. If it's possible, where should I change the source code? Any reply will be much appreciated!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/picard/issues/801, or mute the thread https://github.com/notifications/unsubscribe-auth/ACnk0olfzHzFX5o7oNFHo1nRDWkBNr2Gks5rzv23gaJpZM4NIhEI .

liugui commented 7 years ago

@yfarjoun Thanks for your reply! I read the source code of picard these two days, and found there are mainly two steps:

  1. buildSortedReadEndLists to collect the paired reads information and sort
  2. Mark duplicates reads and write.

I found that these two steps are almost as the same time consuming. So do you mean the Input and encoding in the first step and the Output and decoding in the second step is the major slow slowness?

Thanks very much.

yfarjoun commented 7 years ago

The second step requires input as well, doubly so, as the SortingCollection (of ReadEnds) has likely been spilled to disk, and the bam itself needs to be read-again and since we need to update the PG record in the bam we need to eagerly decode, and then encode (yet again!)

There is slight saving that can happen during the first pass as there we can avoid eagerly-decoding as we only need the read-name and CIGAR, but not the tags...

On Wed, Apr 26, 2017 at 9:42 PM, liugui notifications@github.com wrote:

@yfarjoun https://github.com/yfarjoun Thanks for your reply! I read the source code of picard these two days, and found there are mainly two steps:

  1. buildSortedReadEndLists to collect the paired reads information and sort
  2. Mark duplicates reads and write.

I found that these two steps are almost as the same time consuming. So do you mean the Input and encoding in the first step and the Output and decoding in the second step is the major slow slowness?

Thanks very much.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/picard/issues/801#issuecomment-297587815, or mute the thread https://github.com/notifications/unsubscribe-auth/ACnk0rWMK25DWJKtExGhYs2wwz4uEN4wks5rz_J2gaJpZM4NIhEI .

liugui commented 7 years ago

@yfarjoun I profiled the performance of picard by CPU profile, and here is the most two time consuming functions:

  1. Java_java_util_zip_Deflater_deflateBytes 705.293s libzip.so
  2. Java_java_util_zip_Inflater_inflateBytes 160.719s libzip.so

Indeed they are I/O functions. So I think it's difficult for me to improve the performance by adding more threads. Another way, when I sort the records in a SAM file(not by samtools or any other sort tools, I write the function on my own), I just compare different records by these steps:

  1. compare the reference index
  2. compare the start position
  3. compare the sum of the base quality beyond 15 and mark duplicate

By these three steps, I can sort and mark duplicates at the same time. But I know this kind of mark duplicate is not enough. So I wonder to know how will this influence the vcf result at last? Or any other better advice? Thanks again for your patience.

yfarjoun commented 7 years ago

The main issue is that mark duplicates doesn't use the regular "coordinate" sort order but rather it depends on the "5' unclipped" ends, of both reads for the sort order. I suspect that you will get more false-positives, especially if PCR is part of the library prep.

Good luck!

lishengting commented 11 months ago

@yfarjoun Is it possible to split the bam file by chromosomes and run picard on each part?

yfarjoun commented 11 months ago

It's a nice idea, but there be dragons...

If all your templates are always mapped into one chromosome, that could work. However, there always seem to be some reads that span two chromosomes so once you split your reads by chromosome you might get inconsistent markings and even an invalid bam file when you try to stitch the parts back together.

One thing that might be possible (need to think about it a little more...) would be to split the bam into:

merge-sort the results back.

The important thing is that all records from each template must be contained in the same "bucket" so that they are seen (by MarkDuplicates) together and marked consistently.