CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
491 stars 190 forks source link

makes umi_tools deterministic with --random-seed #550

Closed TomSmithCGAT closed 1 month ago

TomSmithCGAT commented 2 years ago

Trying to resolve issues with umi_tools being non-deterministic even with --random-seed. We have two open PRs on this.

I found #365 was sometimes reporting the same read multiple times. For example, running the test umi_tools group -L test.log --out-sam --random-seed=123456789 --method=adjacency --output-bam --group-out=group_adj_py3.tsv -I tests/chr19.bam > branch_out

SRR2057595.11597812_ATAAA is reported twice, but just once (as it should be) when using the master branch.

$ grep SRR2057595.11597812_ATAAA *out
branch_out:SRR2057595.11597812_ATAAA    16  chr19   4078297 255 38M *   0   0   *   *   XA:i:1  MD:Z:29A8   NM:i:1  UG:i:52 BX:Z:ATAAA
branch_out:SRR2057595.11597812_ATAAA    16  chr19   4078297 255 38M *   0   0   *   *   XA:i:1  MD:Z:29A8   NM:i:1  UG:i:53 BX:Z:ATAAA
master_out:SRR2057595.11597812_ATAAA    16  chr19   4078297 255 38M *   0   0   *   *   XA:i:1  MD:Z:29A8   NM:i:1  UG:i:52 BX:Z:ATAAA

470 works as intended but introduces a non-conda dependency, as discussed. We could rip the code or get siphashc into conda, but we can take a much simpler route if we don't care about having to update the test files.

Following @TyberiusPrime's suggestion in #470, this PR just adds a sort to the components in network.py

Runtime is unaffected, at least on the example.bam file we include in the release (9.4-9.7s with both master and this branch).

The test files will need to be updated of course. I just wanted to raise the PR first to check you are OK with this @IanSudbery before I update them and merge.

Note that the output when using the adjacency method can change with respect to not just which reads are output, but also how many reads, since the order of the UMIs with the same counts affects how many steps are required to account for a connected component. I see no issue with the number of reads returned in these cases being fixed by the sort, given we are currently treating any possible resolution as equally probable.

For the other methods, the number of reads returned will be unchanged.

IanSudbery commented 2 years ago

In general I have no issue with this, but I want to check performance on bigger input files (with 10s of thousands of UMIs at one position).

TomSmithCGAT commented 2 years ago

Have you got something in mind?

IanSudbery commented 2 years ago

I should have something somewhere from when I did benchmarking for the grant proposal.

TomSmithCGAT commented 2 years ago

Hi @IanSudbery - Did you have a chance to check the runtimes on bigger files?

TomSmithCGAT commented 1 year ago

Hi @IanSudbery. Would you be able to run the benchmarks that you wanted to before I merge this. The tests fail at the moment because many of the test files need to be updated since this fixed determination yields a different set of UMIs for cases where the selection is between equally likely UMIs, compared to the previous code.

IanSudbery commented 1 year ago

Okay, I checked out timeings. Not much in it. Here I checked times to process a single position with an increasing number of "real" UMIs, check with twenty random PCR errors:

Current:

Centre_umis     time
1       17.33
4       17.27
16      17.3
64      17.21
256     17.65
1024    21.36
4096    70.68
5792    117.89
8192    223.07999999999998
11585   534.42

Proposed:

Centre_umis     time
1       9.26
4       9.15
16      9.21
64      9.19
256     9.61
1024    13.61
4096    62.99
5792    111.89
8192    219.17000000000002
11585   592.01

I'm not sure why there is a small but constant reduction in time for the new method at lower UMI counts - possibly something to do with the other changes since this branch was proposed? Anyway, other than that, very little in it.

I also tested on a 5M read real data set, with 4M positions, and an average of 1.1 UMI per position. Both versions took exactly the same 165 seconds.

IanSudbery commented 3 months ago

I assume we are not worried that this will introduce a bias into which UMIs are selected. My guess is that it makes ties more likely to be solved in favour of alphabetically earlier UMIs? I'm not particulalry worried about this, but I thought I'd note that we are aware of it.

Its also worth noting that the siphash referred to in the first post on this PR is actaully now the default hash method in python. >=3.11

IanSudbery commented 1 month ago

🎉