fix umi-tools dependency on python hash function values

CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets

MIT License

491 stars 190 forks source link

fix umi-tools dependency on python hash function values #470

Closed TyberiusPrime closed 1 month ago

TyberiusPrime commented 3 years ago

The umi-tools algorithm has an unfortunate dependency on the actual hash values python generates on strings With the anti-DOS mitigations from python 3.3, these change with every run.

This means, unless the user set's PYTHON_HASHSEED=0, in addition to --random-seed, umi-tools output is non deterministic.

That is surprising, and also, undocumented except in github issues.

This PR forces the set to use the equivalent of PYTHON_HASHSEED=0 in the one place that matters to the test cases.

I have not performed performance benchmarks - couldn't find such a facility in the repo.

There is one remaining test case failing 'group_adjacency', but it also fails with PYTHON_HASHSEED=0 on a clean checkout of either master or 1.1.1 - there is something else going on there.

IanSudbery commented 3 years ago

Hi TyberiousPrime,

Thanks for your contribution. I'm afraid I'm not comfortable with the addition of an additional dependency without some research first, especially one that is not on conda as conda is our preferred installation route. As we want a pure conda install, this might mean convincing someone to add siphashc to bioconda or conda-forge, or even maintaining a recipe ourselves. We've worked hard to try to keep the number of dependencies as low as possible.

@TomSmithCGAT Thoughts?

TyberiusPrime commented 3 years ago

Possibly, you can sort by the actual string in the return of breadth_first_search, fix up the test cases, and declare the new filtering canonical.

Or, to avoid a dependency, rip the siphashc code into the cython you have already. Its license does permit that. It's only 80 lines of very straight c.

TomSmithCGAT commented 3 years ago

Thanks for the PR @TyberiusPrime. It's always a relief when an issue turns into a PR rather than another item on the never ending to-do list!

I agree with @IanSudbery that we don't want to add a dependency outside conda. On that basis that the licence permits it, I would have no objection to copying the siphashc code over (with clear accreditation in the code) if we decide to go down that route.

I'm in two minds as to whether sorting so it's completely determinative is a good idea. It feels like it should remain non-determinative by default since there are cases where multiple reads are all equally 'good' representative reads for a UMI group. On the other hand, if there's no downside for making it determinative, I can't really see any reason not too? I can't imagine any cases where sorting would detriment downstream analysis, though maybe that's a lack of imagination on my part. @IanSudbery?

IanSudbery commented 3 years ago

In other places where two reads are equivalent, like which read to keep as representative of a read bundle, the decsision is explicitly random (and therefore can be determined by the random seed).

TomSmithCGAT commented 3 years ago

Good point. In that case, any objections to adding a sort to get around this PYTHONHASHSEED stuff?

I had another quick look into whether it can be set within the script. Appears I may have been mistaken before. Needs to be set before the interpreter starts (https://stackoverflow.com/questions/32538764/unable-to-see-or-modify-value-of-pythonhashseed-through-a-module) so it's either 1. add sort, 2. change hash function (e.g siphashc), or 3. update documentation and leave it as it is.

I favour option 1.

IanSudbery commented 3 years ago

I guess we'll need to do some performance benchmarks to see what the performance hit is, but I can believe its much.

TomSmithCGAT commented 1 month ago

Resolved by #550