bcgsc / NanoSim

Nanopore sequence read simulator
Other
233 stars 56 forks source link

Output is not reproducible #163

Open baraaorabi opened 2 years ago

baraaorabi commented 2 years ago

In line 1505 of the simulator.py script: https://github.com/bcgsc/NanoSim/blob/23911b67ce4733f0468ac26296e25348e3b73b4b/src/simulator.py#L1504-L1506

The seed is reset before each process/thread is started. This overrides the setting of the random seed in here: https://github.com/bcgsc/NanoSim/blob/23911b67ce4733f0468ac26296e25348e3b73b4b/src/simulator.py#L2180-L2181

It also causes the the output of Nanosim to be not reproducible despite using the same value for the --seed arg. I think these two lines (L1505 and L1506) should be dropped. Or maybe they should be replaced with something like?

deterministically_random_seed = random.getrandbits(32)
np.random.seed(deterministically_random_seed) 
random.seed(deterministically_random_seed)

This would give each thread a new random seed that is deterministically defined by the --seed value which should ensure reproducibility across runs with the same --seed.

kmnip commented 2 years ago

Thanks Baraa! I agree with you. I have suspicions that this is an issue when I was reading through simulator.py recently. @SaberHQ and I will look into it!

schorlton-bugseq commented 1 month ago

Hi @kmnip , is there a reason we would need to seed repetitively by the number of threads or can we remove these lines as suggested by @baraaorabi to enable reproducibility? Happy to MR if appropriate. Thanks for your consideration!