fulcrumgenomics / fgbio

Tools for working with genomic and high throughput sequencing data.
http://fulcrumgenomics.github.io/fgbio/
MIT License
314 stars 67 forks source link

Positional dependence during GroupReadsByUmi #122

Closed chapmanb closed 8 years ago

chapmanb commented 8 years ago

We've been getting started with an AnnotateBamWithUmis, GroupReadsByUmi and CallMolecularConsensusReads pipeline to consolidate UMI tagged input fastqs and a couple of conceptual questions and feature requests came up and the grouping process:

Thanks much for putting together these tools.

mjafin commented 8 years ago

On point # 1, in marking duplicates samblaster doesn't just use fragment alignment start/end coordinates but extends these by any soft clipped bases. This increases the chances of catching the true duplicates in one group - wonder if you take this into consideration?

tfenne commented 8 years ago

@chapmanb @mjafin: Thanks for the feedback and questions. The positions used in identifying source molecules do take into account soft-clipping just like Picard's MarkDuplicates and samblaster, i.e. the read ends are unclipped before taking the position.

It would be rather difficult given the current implementation to allow for wiggle room around the position though. The implementation sorts by the positions and then consumes groups of reads with the same position and uses the UMIs to assign them into groups. Changing that to allow +/- n bases would make the grouping-by-position non-deterministic as it would have to slide over positions and decide which position each read pair most likely came from.

On the other hand, for your transposon use case, it might be relatively easy to expose the code that calculates the positions for a read-pair as a function that could be overridden. I'm not sure I've fully understood, but if, e.g., you could return the start/end of the region instead of the read, that might work?

chapmanb commented 8 years ago

Tim -- thanks so much for all the details on soft clipping, it sounds like that's exactly what we need. Brilliant. Thanks also for the details on changing out for non-exact position dependence. This does sound like too much work/abstraction for an esoteric use case, so no worries at all. We can group in a separate step after tagging for this experiment instead of excessively generalizing your code here. Thanks again.