Representative sequences after deduplication not consistent between different runs

CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets

MIT License

491 stars 190 forks source link

Representative sequences after deduplication not consistent between different runs #440

Closed zxl124 closed 8 months ago

zxl124 commented 4 years ago

When I run dedup on the same BAM files twice, even with the same --random-seed, the returned deduped BAM files have different sets of reads. This has a very small but non-zero effect on downstream analysis. Would it be possible to have completely consistent results between runs when random seed is the same?

For context, the input BAM was coordinate-sorted, and generated using STAR. dedup was run with --random-seed 100 --spliced-is-unique --multimapping-detection-method=NH.

TomSmithCGAT commented 4 years ago

If you set the bash variable PYTHONHASHSEED, the output should be consistent.

Since python 3.3, the hashing used in e.g dictionary keys is non-determininistic and are 'salted' with a unpredictable random values: https://docs.python.org/3.4/reference/datamodel.html#object.__hash__. I understand this is prevent DOS attacks.

@IanSudbery - Should we add the above to the FAQ?

IanSudbery commented 4 years ago

Yes. I guess there is no way to hardcode this?

TomSmithCGAT commented 4 years ago

Seems like it is possible: https://stackoverflow.com/questions/32538764/unable-to-see-or-modify-value-of-pythonhashseed-through-a-module. I think it would make sense from a user point of view if --random-seed set the value for PYTHONHASHSEED add made the output deterministic. Agree?

SPPearce commented 2 years ago

Has this actually been fixed in a release? I'm seeing the same non-deterministic behaviour in dedup, even after setting random-seed. Can a note be added to the website, to make it clear that random-seed on its own isn't sufficient. Trying exporting the PYTHONHASHSEED now, but it has taken me a while of digging in these issues to find the fix.

TomSmithCGAT commented 2 years ago

Hi @SPPearce - Sorry for the wasted time spent digging into how to make UMI-tools determininstic.

We have two open PRs to deal with this (#365 & #470), and I have a separete idea I wanted to try as well. I'm optimistically hoping to decide which route to take this week and then issue a new version. I've been saying that for the past few weeks though 😬

TomSmithCGAT commented 8 months ago

See the outstanding #550