biod / sambamba

Tools for working with SAM/BAM data
http://thebird.nl/blog/D_Dragon.html
GNU General Public License v2.0
558 stars 104 forks source link

Optical duplicate calculation during markdup #325

Closed RichardCorbett closed 3 years ago

RichardCorbett commented 6 years ago

Hi all,

Using the HiSeqX sequencer we occasionally observe a peak in duplicate rates that can be attributed to fragments being amplified across adjacent wells on the flowcell. To assess the rate at which this happens we use the Picard MarkDuplicates command while supplying the following extra parameters to allow Picard to parse our read names...

...OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 READ_NAMEREGEX="[a-zA-Z0-9]+:[0-9]+:([0-9]+):([0-9]+):([0-9]+).*"

This allows us to get a report in the metrics file from which we can calculate the fraction of duplicates that are adjacent on the flowcell.

Our current workflow is to mark duplicates with sambamba, but when we suspect a peak in "proximal" duplicates we have to return to Picard to get the estimate.

thanks, RIchard

pjotrp commented 3 years ago

No activity

pjotrp commented 3 years ago

I am closing this issue. It is interesting and if someone wants to work on it it can be reopened.