BenLangmead / bowtie2

A fast and sensitive gapped read aligner
GNU General Public License v3.0
638 stars 160 forks source link

feature request: optional non-random selection among multiple best alignments #450

Open eboyden opened 8 months ago

eboyden commented 8 months ago

It's my understanding that when there are multiple "best" alignments, Bowtie2 picks one pseudorandomly (usually in a deterministic fashion unless overridden). Often this is the preferred behavior, but there are situations when it is not. For example, for a somatic oncology application for which a) the duplicate rate is nontrivial, b) UMIs are employed to assist with generation of consensus reads from duplicates, and c) paralogous regions are targeted, within a UMI family of duplicate reads, some members could be assigned to paralog A and some could be assigned to paralog B. Each group would then be condensed to its own "unique" consensus read based on position despite the fact that they all have the same UMI, thereby artificially inflating the number of unique reads. Some consensus read generators (e.g. fgbio) unmap the consensus reads, so they must be realigned, and there isn't a guarantee that the consensus reads would be evenly distributed among the paralogs after realignment, making the initial problem even worse. E.g.:

10 duplicate reads > map 6 to paralog A + 4 to paralog B > consensus read A + consensus read B > randomly map 0 to paralog A + 2 to paralog B

A better option in this case would be to initially align all reads with non-random "best" alignments (e.g. earliest position in the reference), so that all similar reads are grouped together prior to consensus calling. Subsequently, random "best" alignment could be performed on the consensus reads. E.g.:

10 duplicate reads > map 10 to paralog A + 0 to paralog B > consensus read A > randomly map to either paralog A or paralog B

In this case even though the consensus read has a 50% chance of being assigned to the incorrect paralog, at least the problem isn't compounded by doubling the total coverage.