Closed lebolo closed 4 years ago
Hi @lebolo. This is actually a great question, I might have some elements of answer but not sure. What python version are you using?
I'm using Python 3.7.3
BTW, thanks for the great tool!
I'll look into this later, but I believe that the use of set()
, in particular in the MutationSpace module, can introduce some randomness independently from numpy.seed()
.
For instance, the result of list(set(['a', 'b', 'c']))
will be consistent inside a same python session, but not between sessions. If this is the problem, then it should be fixable.
Thanks @Zulko! For my test case above, I swapped out set
for OrderedSet
(https://pypi.org/project/ordered-set/) in MutationSpace
and EnforceTranslation
and it seems to be reproducible across Python sessions now.
Hi @lebolo, I believe I also fixed it on my side by un-seting the variants in choose_random_variants
of the MutationSpace
module. I am in the process of generating samples for the tests and I'll update Github/PyPI very soon.
I pushed a version on PyPI that hopefully enforces determinism based on the numpy.random.seed
. Let me know how it works for you and please open new issues if you meet new reproducibility problems.
Thanks Zulko, I updated and tried running mytest_simple_optimization
above but it still fails to generate the same sequence across sessions.
One difference between your defined problem and mine is that I have EnforceGCContent(mini, maxi, window)
as a constraint whereas you have EnforceGCContent(target)
as an objective. If I move my GC content to an objective and change it to a target, it becomes reproducible.
Any thoughts on how to make it reproducible for my use case (where I wanted to constrain that it fits between a GC range)?
On it.
I think I got a fix this time. At least it works with your example, and in the tests I created around it. Could you try upgrading and trying again?
What happened is that my previous fixes only ensured determinism in the method for random search, not for exhaustive search (which is used by CodonOptimize). The method MutationSpace.all_variants
still relied on a list(set(...))
pattern which is not deterministic across sections. This is now fixed.
It's reproducible now for all of my tests. Thanks so much for the help!
Awesome. Reproducibility was also a thorn in my foot (it's not great to debug a non-deterministic environment) so it had been on my todo list for some time. Closing this, but don't hesitate to report any other trouble!
I am having some issues with reproducibility as well but my application is slightly different. I have 10 sequences each with different point mutations in the protein. I tried to set the SEED to see if I would get the same codons selected for most of the sequence but that doesnt seem to be the case. What would be your suggestion to deal with cases like that? Thanks
See this file on how the seed is set: https://github.com/Edinburgh-Genome-Foundry/DnaChisel/blob/master/tests/builtin_specifications/test_CodonOptimize.py
And this test specifically tests for determinism: https://github.com/Edinburgh-Genome-Foundry/DnaChisel/blob/master/tests/test_determinism/test_determinism.py
If these don't solve the problem, please post a reproducible example and I'll try it
Is there a way to force
DnaOptimizationProblem
to return the same optimized sequence for a given input sequence (particularly when enforcing GC content)? I saw in your tests that you try setting the Numpy random seednumpy.random.seed(123)
, but that only works when I run problems in a given Python session.Here is my basic setup.
If you run
test_reproducibility
you get the same sequence for all iterations of the loop. This test passes.However, I tried running
test_simple_optimization
once, recording the output and storing it intooptm_dna_seq
, then running the test again. This almost always fails for large sequences.I can't tell where the randomization is happening. Is there a different random number generator that I should set the seed for?
Sorry if this is a bit convoluted, couldn't think how else to explain it.