GregoryFaust / samblaster

samblaster: a tool to mark duplicates and extract discordant and split reads from sam files.
MIT License
225 stars 30 forks source link

add support for read group and PCR free optitical duplicate only filtering #53

Open carsonhh opened 3 years ago

carsonhh commented 3 years ago

I've added a --opticalOnly duplicate option to only mark optical duplicates in PCR free data. This is done using a shortcut where reads on the same tile are considered duplicates rather than trying to measure the distance between reads on the same tile. I also added a --optPlusExAmp flag to mark reads in the same lane as duplicates (should capture both optical and exAmp duplicates which can occupy positions within the same lane). The read group and tile/lane support use khashs to keep track of a 20 bit iterator that identifies uniq RG/tilenumber/lanenumber combinations . You should be able to add UMI support by adding a single method to pull values out of the SAM attributes similar to what happens here in extraction the the RG value. Additional memory use is restricted to the size of the new khash that tracks the RG/tilenumber/lanenumber combinations. Effect on runtime is negligable