Positional dependence during GroupReadsByUmi

chapmanb commented 8 years ago

We've been getting started with an AnnotateBamWithUmis, GroupReadsByUmi and CallMolecularConsensusReads pipeline to consolidate UMI tagged input fastqs and a couple of conceptual questions and feature requests came up and the grouping process:

How are read positions treated for the grouping process? The UMI edit distance and grouping algorithms make sense but we were not sure the conditions under which reads are considered identically mapped for grouping. Does this take into account soft-clipping? Is there currently wiggle room in coordinate matching or are exact mappings of both reads required?
As a feature request, would it be possible to make the coordinate matching flexible to nearby positions, where the definition of nearby is configurable? We have applications using UMIs for tagging transposons where read mappings will be within a region and it would be useful to group all reads within a window with the same UMI.

Thanks much for putting together these tools.

mjafin commented 8 years ago

On point # 1, in marking duplicates samblaster doesn't just use fragment alignment start/end coordinates but extends these by any soft clipped bases. This increases the chances of catching the true duplicates in one group - wonder if you take this into consideration?

tfenne commented 8 years ago

@chapmanb @mjafin: Thanks for the feedback and questions. The positions used in identifying source molecules do take into account soft-clipping just like Picard's MarkDuplicates and samblaster, i.e. the read ends are unclipped before taking the position.

It would be rather difficult given the current implementation to allow for wiggle room around the position though. The implementation sorts by the positions and then consumes groups of reads with the same position and uses the UMIs to assign them into groups. Changing that to allow +/- n bases would make the grouping-by-position non-deterministic as it would have to slide over positions and decide which position each read pair most likely came from.

On the other hand, for your transposon use case, it might be relatively easy to expose the code that calculates the positions for a read-pair as a function that could be overridden. I'm not sure I've fully understood, but if, e.g., you could return the start/end of the region instead of the read, that might work?

chapmanb commented 8 years ago

Tim -- thanks so much for all the details on soft clipping, it sounds like that's exactly what we need. Brilliant. Thanks also for the details on changing out for non-exact position dependence. This does sound like too much work/abstraction for an esoteric use case, so no worries at all. We can group in a separate step after tagging for this experiment instead of excessively generalizing your code here. Thanks again.

fulcrumgenomics / fgbio

Positional dependence during GroupReadsByUmi #122