GregoryFaust / samblaster

samblaster: a tool to mark duplicates and extract discordant and split reads from sam files.
MIT License
225 stars 30 forks source link

Support for DT tag to distinguish PCR and optical duplicates #38

Open seboyden opened 6 years ago

seboyden commented 6 years ago

I'd like to request a feature analogous to the Picard MarkDuplicates TAGGING_POLICY option, where setting All will record the Duplicate Type (PCR or optical) in the optional DT tag, and OpticalOnly will only mark optical duplicates. It's often recommended to only mark optical duplicates on data from PCR-free library prep, which includes most WGS. Thanks!

GregoryFaust commented 6 years ago

I agree that PCR free WGS has become the norm. Therefore, I think this is a good suggestion. However, it does require that samblaster parse read-ids, something that it does not do currently. I will strongly consider this feature for any upcoming major release of samblaster.

seboyden commented 6 years ago

Thanks—I (and others) will appreciate it!

seboyden commented 4 years ago

Any further consideration of adding optical duplicate marking?

GregoryFaust commented 4 years ago

Yes, I have been thinking about how to do this, but it is difficult in a one-pass algorithm that samblaster must use to satisfy its primary usage scenario in a pipe. In particular, I have yet to imagine a solution that does not approximately double the amount of memory used by samblaster in order to keep track of the Illumina flow cell location for reads.

seboyden commented 4 years ago

Thanks, I think 2X memory usage might be acceptable given this would be optional, especially if warned about the increased memory in the documentation/help.

carsonhh commented 3 years ago

I've submitted a pull request of changes I Mae that would allow this. You should be able to add UMI support on top of that in just a few minutes.