Use GroupReadsByUmi for standard duplicate marking?

eboyden commented 1 year ago

Few tools exist that will properly perform umi-based duplicate marking. Picard seems to be the only tool that can mark duplicates with error tolerance and properly sets the bitflag, but its algorithm isn't sophisticated; it just uses a hamming distance. Samtools recently added umi awareness to markdup, but currently it is not error-tolerant. Umi-tools can remove duplicates, but its markdup function doesn't work properly on paired end data, and doesn't set the bitflag. Whereas fgbio GroupReadsByUmi works properly on paired data using a sophisticated algorithm, but doesn't set the bitflag (I realize this is because it is intended to be run upstream of CallMolecularConsensusReads, not for standard duplicate marking).

One workaround is to run GroupReadsByUmi and then pipe the output to Samtools markdup, pointing it to the MI tag instead of the RX tag; the error tolerance has already been handled so Samtools doesn't need to do the heavy lifting there. This seems to work well and produces the expected output. But even when -m 0 is used, GroupReadsByUmi still filters out a lot of reads (unmapped or N-containing umis), whereas it would be ideal if the final marked bam were lossless.

Could functionality be added to GroupReadsByUmi to simply ignore reads that would otherwise be filtered, and then perform that filtering in the consensus calling tools themselves (or ignore them there as well)? If it would break something to not apply an MI tag, all reads destined for filtering could receive a unique or "N" MI tag. Selecting a "best" read in a family and setting the duplicate bitflag for the rest (thereby eliminating the need for a secondary tool) would be nice but I realize it may be out of scope.

Related: GroupReadsByUmi filters out reads with Ns in the umi. Could this be modified so that "N" always counts as a mismatch (including to other Ns), and reads within the specified edit distance are still collapsed? E.g. with -e 1 ACGN would still be collapsed into ACGT. (This is how Picard handles Ns in umis.)

nh13 commented 1 year ago

Sounds like a useful feature. Thanks for the suggestion!

JoeVieira commented 1 year ago

@nh13 , @eboyden & I are colleagues at Molecular Loop, I'd be happy to contribute a PR for this functionality if you're willing to accept it. This would help us out & we want to contribute back to the project.

nh13 commented 1 year ago

Caveat: we're always happy to review PRs, though I want to be fully transparent we are 100% client focused at this time, so there may be delays in reviewing. Feel free to reach out to us on our website if this becomes a higher priority.

For the "N" as a mismatch, that seems like an easy opt-in command line option, so feel free to make a PR for that separately from the mark duplicate work.

For the duplicate marking, we have to think carefully about how to select which read is "not a duplicate" (see htsjdk). Next, we also have to think if we want to add "optical duplicate" detection like Picard. Also, do we want to output any metrics specific to duplicate detection like Picard. Finally, we need to also track, flag, and output the secondary and supplementary reads (currently filtered out), along with re-considering if the other filters should be applied. I'd be fine not marking optical duplicates, having a single method to pick the "not duplicate" read. Thoughts?

JoeVieira commented 1 year ago

Totally understand & appreciate the transparency.

Agreed on both counts, the N mismatch is simple - I'll raise a PR for that, that's a small quality of life thing.

Agreed adding optical duplicates seems like a discrete feature in addition to this one, happy to leave that out.

For marking dups, the another established approach, is the default for umi-tools, the highest mapping quality as the "representative read". I see that htsjdk already has support for the scoring strategies Picard offers.

I'd propose extending that to include mapping quality & give those ScoringStrategy as options for an end user to choose between. ( also, understand this means a PR needs to be accepted to htsjdk also. )

For filtering, it strikes me that you've already built a good deal of flexibility into that, so I can't see a compelling case to adjust that, other than the N filter.

For supplementary & secondary reads - agreed it would be ideal to handle those, rather than force filtering them. Given the current behaviour ( filtering all of them when grouping ) it doesn't seem tightly coupled to this functionality and should we consider also treating that as a discrete feature addition?

nh13 commented 1 year ago

For supplementary & secondary reads - agreed it would be ideal to handle those, rather than force filtering them. Given the current behaviour ( filtering all of them when grouping ) it doesn't seem tightly coupled to this functionality and should we consider also treating that as a discrete feature addition

I'd be happy for this to be a separate PR or together with the duplicate marking, as the latter motivates the former, and we haven't needed the former up until now (since we always dup-mark).

JoeVieira commented 1 year ago

@nh13 I'll be adding another PR or two this week for this issue.

The first will be adding duplicate bitflag, it will consider & mark secondary & supplemental as well ( if they are in the Iterator ) It'll expose two new options -d and -D the first is just a flag to mark dups the section lets one choose the strategy ( currently just a pass thru for the strategies in htsjdk. Then will be adding a flags to more granularity control if secondary & supplementary reads are filtered out.

If there are any other thoughts you'd like to consider with this please let me know.

I'm also willing to take on any other work in the UMI / Duplicate marking regions of code which you have in your roadmap, to just help out too.

thanks!

nh13 commented 1 year ago

@tfenne

Poshi commented 8 months ago

I'm also interested in the ability to use this tool to correct the UMIs and get a BAM file with the corrected UMIs suitable for duplicate marking. In my particular case, I'm happy either running samtools markdup for that or using this very same tool for the duplicate marking. But I need to have a complete output, without any read discarded, as requested in the original issue message. I'd love to see a way to retain the reads that are currently filtered out (like the unmapped ones). Thanks @JoeVieira for your work on this issue :-)

fulcrumgenomics / fgbio

Use GroupReadsByUmi for standard duplicate marking? #906