Closed zxl124 closed 8 months ago
If you set the bash variable PYTHONHASHSEED
, the output should be consistent.
Since python 3.3, the hashing used in e.g dictionary keys is non-determininistic and are 'salted' with a unpredictable random values: https://docs.python.org/3.4/reference/datamodel.html#object.__hash__. I understand this is prevent DOS attacks.
@IanSudbery - Should we add the above to the FAQ?
Yes. I guess there is no way to hardcode this?
Seems like it is possible: https://stackoverflow.com/questions/32538764/unable-to-see-or-modify-value-of-pythonhashseed-through-a-module. I think it would make sense from a user point of view if --random-seed
set the value for PYTHONHASHSEED
add made the output deterministic. Agree?
Has this actually been fixed in a release? I'm seeing the same non-deterministic behaviour in dedup
, even after setting random-seed
. Can a note be added to the website, to make it clear that random-seed
on its own isn't sufficient. Trying exporting the PYTHONHASHSEED
now, but it has taken me a while of digging in these issues to find the fix.
Hi @SPPearce - Sorry for the wasted time spent digging into how to make UMI-tools determininstic.
We have two open PRs to deal with this (#365 & #470), and I have a separete idea I wanted to try as well. I'm optimistically hoping to decide which route to take this week and then issue a new version. I've been saying that for the past few weeks though 😬
See the outstanding #550
When I run
dedup
on the same BAM files twice, even with the same--random-seed
, the returned deduped BAM files have different sets of reads. This has a very small but non-zero effect on downstream analysis. Would it be possible to have completely consistent results between runs when random seed is the same?For context, the input BAM was coordinate-sorted, and generated using STAR.
dedup
was run with--random-seed 100 --spliced-is-unique --multimapping-detection-method=NH
.