Closed TomSmithCGAT closed 1 month ago
In general I have no issue with this, but I want to check performance on bigger input files (with 10s of thousands of UMIs at one position).
Have you got something in mind?
I should have something somewhere from when I did benchmarking for the grant proposal.
Hi @IanSudbery - Did you have a chance to check the runtimes on bigger files?
Hi @IanSudbery. Would you be able to run the benchmarks that you wanted to before I merge this. The tests fail at the moment because many of the test files need to be updated since this fixed determination yields a different set of UMIs for cases where the selection is between equally likely UMIs, compared to the previous code.
Okay, I checked out timeings. Not much in it. Here I checked times to process a single position with an increasing number of "real" UMIs, check with twenty random PCR errors:
Current:
Centre_umis time
1 17.33
4 17.27
16 17.3
64 17.21
256 17.65
1024 21.36
4096 70.68
5792 117.89
8192 223.07999999999998
11585 534.42
Proposed:
Centre_umis time
1 9.26
4 9.15
16 9.21
64 9.19
256 9.61
1024 13.61
4096 62.99
5792 111.89
8192 219.17000000000002
11585 592.01
I'm not sure why there is a small but constant reduction in time for the new method at lower UMI counts - possibly something to do with the other changes since this branch was proposed? Anyway, other than that, very little in it.
I also tested on a 5M read real data set, with 4M positions, and an average of 1.1 UMI per position. Both versions took exactly the same 165 seconds.
I assume we are not worried that this will introduce a bias into which UMIs are selected. My guess is that it makes ties more likely to be solved in favour of alphabetically earlier UMIs? I'm not particulalry worried about this, but I thought I'd note that we are aware of it.
Its also worth noting that the siphash
referred to in the first post on this PR is actaully now the default hash method in python. >=3.11
🎉
Trying to resolve issues with umi_tools being non-deterministic even with --random-seed. We have two open PRs on this.
I found #365 was sometimes reporting the same read multiple times. For example, running the test
umi_tools group -L test.log --out-sam --random-seed=123456789 --method=adjacency --output-bam --group-out=group_adj_py3.tsv -I tests/chr19.bam > branch_out
SRR2057595.11597812_ATAAA
is reported twice, but just once (as it should be) when using the master branch.470 works as intended but introduces a non-conda dependency, as discussed. We could rip the code or get siphashc into conda, but we can take a much simpler route if we don't care about having to update the test files.
Following @TyberiusPrime's suggestion in #470, this PR just adds a sort to the components in network.py
Runtime is unaffected, at least on the example.bam file we include in the release (9.4-9.7s with both master and this branch).
The test files will need to be updated of course. I just wanted to raise the PR first to check you are OK with this @IanSudbery before I update them and merge.
Note that the output when using the adjacency method can change with respect to not just which reads are output, but also how many reads, since the order of the UMIs with the same counts affects how many steps are required to account for a connected component. I see no issue with the number of reads returned in these cases being fixed by the sort, given we are currently treating any possible resolution as equally probable.
For the other methods, the number of reads returned will be unchanged.