CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
481 stars 190 forks source link

UMI-Tools does not remove all duplicate reads #650

Closed Redmar-van-den-Berg closed 1 month ago

Redmar-van-den-Berg commented 3 months ago

Describe the bug UMI-Tools does not remove all duplicate reads. This can be verified by running UMI-Tools on it's own output, which removes additional reads.

To Reproduce Take any bam file that has been deduplicated with UMI-Tools, and run UMI-Tools again on the same output. In my test, running UMI-Tools a third time removes no more reads.

umi_tools dedup \
                --method cluster \
                --paired \
                --stdin={input.bam} \
                --stdout={output.bam}

This issue occurs both with and without --paired mode.

Expected behavior I would expect UMI-Tools to remove all duplicate reads in a single pass.

Environment docker://quay.io/biocontainers/umi_tools:1.1.1--py38h0213d0e_1

Additional context This issue appears not to be related to duplicate reads in the output when using --paired as described in #347, since all occurrences of reads are completely removed from the bam file after running UMI-Tools a second time.

It does not appear that this issue is related to any specific bam flags in the reads, in my tests there wasn't a single flag or combination of flags that was not effected by this bug. I've attached some flag counts I made when testing running UMI-Tools multiple times:

once-twice.zip

IanSudbery commented 3 months ago

While I can't tell exactly without studying the BAM files in question closely, I don't think necessarily indicates a problem. The most like explanation is to do with the way UMI-tools decides if two UMIs are the same or not.

The UMI-tools algorithm will merge two UMIs at the same mapping position if the edit distance between them is 1 (by default), and the count of the more common UMI is at least 2n-1, where n is the count of the less common UMI.

Consider the following situation:

UMI Count AAAAA 10 AAAAT 10

When UMI tools is run, these two UMIs would not be considered the same, and the counts are too similar (2*10-1=19, so the more common one would need to be seen 19 times). As UMI-tools deduplicates in the output each one of these UMIs would appear a single time:

UMI Count AAAAA 1 AAAAT 1

If UMI-tools is run again, these UMIs will be merged, as 2*1-1=1 (in fact, the -1 factor is explicitly to ensure that chains of similar UMIs seen only once are merged).

Thus we would not expect UMIs tools to leave the input unchanged if the output of a previous run is used.

Redmar-van-den-Berg commented 3 months ago

This behavior also happens when you use the --method directional setting. I understand how this behavior follows from the way UMI-Tools is made, but I do not understand why this was chosen. In fact, when I originally read the paper I misread the directional method to mean that UMIs will be merged if their counts are lower than 2n-1.

On a conceptual level, shouldn't removing duplicates from a collection without duplicates remove nothing? Or put another way, if 10x AAAA is considered distinct from 10x AAAT, why is 1 xAAAA considered identical to 1x AAAT?

This has real implications for how many duplicates UMI-Tools identifies: If I have sequenced with a high coverage, AAAA and AAAT will be seen as distinct reads, but if I have low coverage, AAAA and AAAT will be seen as identical.

TomSmithCGAT commented 3 months ago

'I understand how this behavior follows from the way UMI-Tools is made, but I do not understand why this was chosen.'

When I originally tested the 'directional' method, I started with thresholds which were scale invariant (e.g 2n, 3n). However, I observed that there was still a higher than expected frequency of UMIs within a single edit distance and this was due to the presence of chains of UMIs with very low counts. As @IanSudbery says, the -1 in the threshold is explicitly designed to collapse these chains.

In all honesty, I don't think we ever expected to stop at 'directional' and assumed we or someone else would implement a more Bayesian method for collapsing UMIs which would prove more accurate. It would be relatively trivial to test alternative approaches with simulated or real data should anyone have time and motivation to do so.

'On a conceptual level, shouldn't removing duplicates from a collection without duplicates remove nothing?'

I think this is a misconception about what you're doing when you pass through UMI-tools multiple times. UMI-tools is designed to handle a BAM file with all the alignments presented to it. The algorithms have been shown to work well in that use case. If you remove duplicates and collapse down to a single read for each each set of duplicated reads and then pass it to UMI-tools again, you're now working from a different input entirely and one which has simplified the alignments significantly. In that case, why should the alogrithm yield the exact same output as before? It could (as in the case of most of the other methods available), but doing so shouldn't be expected and not doing so doesn't demonstrate that the original deduplication was sub-optimal.

Redmar-van-den-Berg commented 3 months ago

Thank you for the explanation, it's good to know that UMI-Tools is not supposed to produce the same results when ran on it's own output.

I observed that there was still a higher than expected frequency of UMIs within a single edit distance and this was due to the presence of chains of UMIs with very low counts

Would you expect these chains to be caused by PCR duplicates, and hence be invalid? Is it not extremely unlikely that PCR amplification would produce such a chain of UMIs? In effect, that would mean that every cycle introduced a mismatch. Or is there another source of bias that would produce these chains?

TomSmithCGAT commented 3 months ago

The exact source of these chains is unclear to me. PCR errors are probably low enough frequency to discount that being a common cause, while sequencing errors shouldn't create chains by themselves. In combination though, a PCR error plus sequencing error could occasionally create a chain of length 2. Combined with the random sampling of the sequencing, this may occur more than intuitively expected. We took the error distances from random sampling (with frequencies as observed in the data to account for bias in UMI usage) as the best available ground truth and optimised the deduplication algorithms by comparison of the edit distances in the deduplicated BAM. While the algorithms were designed with the possible UMI error sources in mind, they are agnostic as to the source of the error.

Redmar-van-den-Berg commented 3 months ago

Right, if you take into account that UMI-Tools looks at UMIs for reads that are aligned at the same position, it is unlikely that these UMI's reflect true unique reads. Unless the coverage is very high, the odds of finding two or more UMIs that differ in only 1 base by chance should be low.

IanSudbery commented 3 months ago

There is a two dimensional process going on here - the higher the sequencing depth, the higher the chance of finding two UMIs that differ in only 1 base through some error process. The higher the original number of molecules, the higher the chance of finding two UMIs that differ in only one base that are genuinely unique.

Our testing suggets that you should try to keep the amount of "UMI space" used by your original (unamplified) molecule set under ~1/3 of the total available space. I.e., with a 10mer UMI is good as long as you expect fewer than ~300k molecules in any one location.

The chances of PCR mutations are not as low as you might expect. If you run a PCR for 15 cycles, then each original molecule in your tube will have been amplified to ~32,000 molecules, each with two strands that will have been synthesised by the polyermase = 64,000 strands. If each UMI is 10nt long, thats 640,000 bases sythesised by the polymerase. If the polymerase has an error rate of 1e-6 (e.g. Q5 polymerase), then the chance of an error is 1-(1-1e-6)^650,000 ~ 47% chance of at least one error at some point in the process.

A chain of 2 wouldn't require an error in every round - the rounds could be quite distant from each other - it would just require that sampling meant that both the earlier and later mutations were only sequenced once. While this is unlikely for any given molecule, when you do 10/100s of millions of reads, the chance of it happening becomes much larger.

IanSudbery commented 1 month ago

I'm going to close this issue. Let me know if there is anything to add.