Edinburgh-Genome-Foundry / DnaChisel

:pencil2: A versatile DNA sequence optimizer
https://edinburgh-genome-foundry.github.io/DnaChisel/
MIT License
213 stars 38 forks source link

How to force reproducibility? #13

Closed lebolo closed 4 years ago

lebolo commented 4 years ago

Is there a way to force DnaOptimizationProblem to return the same optimized sequence for a given input sequence (particularly when enforcing GC content)? I saw in your tests that you try setting the Numpy random seed numpy.random.seed(123), but that only works when I run problems in a given Python session.

Here is my basic setup.

import random
import numpy
from dnachisel import CodonOptimize, DnaOptimizationProblem, EnforceGCContent, EnforceTranslation

SEED = 123

DNA_SEQ = 'ATGACATTCACCGAGAAGGTATTCAAACAAGTATGCGAATGGTGTATAGGCTTCTATCGTAAATATGCTGTTCCGAGTGAAGAGTGTGACTTCTTCCCTAGATGGATATGGGGGAACAAACGGAACATTCTGACCTTTACATTCAGTACCCCAATTACCTTACTGACGGCCCCGTCGCGACTTCCCGAAGAAGGGTGCCAAATAACCATCTGCGGACAACAGCAGGATTGCTTTGCAGTTGAGCGACACACACTACCGCAGAGCACGAAGATTAACAGATGGTGCGGCCGTTGTGAGCATCATAATCAAATCTATATTGAGCCTACCGAGCCAATTGGTGAACAACACGCGGACCATTTCGATTGCATTGAGAATAACGCTATAGATGACTATCCACCTTCAAATCTACCATGCTTACAAATTCAATCGGTTGAAAATGTCGGCCACCGGGGATGGTGTCAGGTACTGGTAAGATGGTGGGTGGATGCGTACAAAGACGGGGACCCCGTGGCTTGGGGTGACTTTAGGGGCCTGGTGAAATCTGCGGAATGGACCGTCCGCCCCCGGTCCAAGGCTGGAGATGAGACTCACAGAGAGTGGAAACATCCTTCTCTCGGGTGTCCTCACGGAAACATCATCGTTCTTGCTGCACCTGTTTCTTATCATCTCAACATTGATTGCACCGTCAACGTCCATAACCTTCAGAGCGCAGAATATACAGATCCTAGTTACGCTTTCACTCACGTATGTATTACAGCGTTCTTCGCAATACAGGCGATACCGCACGCCCGTATAAGCTTTTCAAAAGAGGCCAATCGGTACTATTATTGGATCATTACCTGTCATTGTTCCAGTTATTGCTGCTTACGTGAATGTTGGATATCGACTCGCAAAGATAAGTGGCATAGCCATAAGTACAAGGACAGCCCTCTATTTGTCGAATTTGGCTGCAAGGAAACTTGGTGCTTAGAACATTATGACAAGGAGGTCTCCCAGACCAAGCGCTACGGTTGCAAGCTCGTTGGAGATCGGTGCGGTTTCCATGAGGCCTTGTTTTTTGAGACCAAATGGAGAAACACGGAACCCCAACGGCTGTACGCATACTGGTCCATAAAAGTAGATTGCTGTAAGATCTATTCGGAAAACAACACAAAGAACGTGGATGAGATAGAATGGGCGACAAGAGAGTGGATTTGGCAAAAAAATATTAAGATACGGTCCAATGTAAGATTCCATTATCAGCCGGACCAAGAACATCCGGACGTCTGGCATGGTTTGTGCGAGCAACTAGTTACTATTGACAAAAATCACGAACTTCCGGACGGCCAGAAAATAATTTGTTGGGGTAAGGAGTGGCATGACGCGATTTACCAGGATATCGCTCCTTTTTACGCGTTAGTGCTGGAATATCGATGGTTTCAGTGGGGCAGATACGAGTGTTGGAAGACCTGGGGTCTCGCGGGCTGCACGGACCTTAATTGGGAGGACCCTTTCAGAGGGACGCACCCCATTCTACACAAGGGACAGGGGTATATTAAAGATGCGGCCTTGTTACGTTGGTTCCGCCACTTACGTTCCTCATTCTTTGCAGGAGTTAAGGCGAAAGACGAAGTCGGATATAATGGTTGCCAGTGGGATTACCGAGATTTACGAGAAGGCTTCGATTGTTGGCAATACACTTGCAGGAACTTTCTTAGTCCTAACCAGGTTAATGACTGTTGGTACAAAGGCGCATGTTGGAAACACACACCCGGATGGTTTGTAAAGGTTAAGTTTTTGTGGTGTACGAATGAAGCTTATTACTGGGGATTAAAGACCTGGCATAAAACACGCCAGATTCATAGTGACGGGACAAACCATGGGCTTACAGAATGCGTTATAACTTCATTGCGCGCAATTGACCAGCGGCCAAATAGCGACAAGGTAGAGAGTCCTGGACGGAAGTGTGAACGCGTGAGTCGAGTCCACTGCAGCTGGCGTAGCCAGGCCAAAGAGAAGGGAAGACAGGAAGCCCATTTCTACCTGGGGTGCGTGGTGGGCATTATAGGACATGACGTGCAAGAGCTTGAGATTGACTTAACGACTGATGACATCAGCAGTAATAAAAATTGCCACCACTGTGATCCTAATGCAAAGTGTCACTTTCACAAGGGGCAGGTAACGGATGAACACGCCCTATGTGATGAATGGGTTGTATTAAGATTTCCATATAATACCCAACGATGGAAAGCCATCCTCAGTTTGGAAGGCTTTTCTCAGCGAGAGTACGACAACTTCCCTCATTTACAAGCCAAACGGTGGCGTAAGGTGGCTCAGATCGATTACACTTTCGAGAATAACTTCTACTGGCATTGGTGGCAGCACCATGGTGGGAAGAAGGTGATCTGGGCGACACCAGCGATTACAGCCGGAATAGATTACCAAGACGCGAGATGCTCACCAACGGTAGCTCATCACAATAAGTTCTCATGTCACTATAATTGGTGGTTTCTTAAGTCCTCAGAGTATGACCAGCACTTCCCTGACACACAGGTAGTTGTTTTAGACTGGGATTGTGCGCCGAAGGGGCAGCTAGAACCGGTTACGTGTATCCAAACGTTTGACCGAATACACGCAGATGTATTCAGACAGTGGAACGCCCAGTGGTGGTGGGATTTCCGATTACAAACGAATTGGTATTGGCACTTCGACAACGATGAATGTCGGCCGTCGCTCCCATGGCGGGATTGTAATCTCCTTCAGTTTTATCTCCAGCTTGGTTGTAAGATCGCATTTCAACGTACCTGGCAGAAAACCCACCATCATCGCGTGTGCCCTGTGCCCGTCAGTTGGTTTCGCTGTATTGACAATCAAATCATTTTCCACTTCATACATGAATCTCAATGGACCCAGTCTCGTCAGAGAACTGTCTACGACCCCGTTCCTCACCGTCATTGGCACTGTTGGCCCTGGGCCATTTGGCGAAATGGACGCTACCATTACAAGTACAGCACACCTTGA'

def get_problem(dna_seq: str, gc_min: float = 0.40, gc_max: float = 0.60, bp_window: int = 60, species: str = 'e_coli'):
    # Force reproducibility
    random.seed(SEED)
    numpy.random.seed(SEED)  

    return DnaOptimizationProblem(
        sequence=dna_seq,
        constraints=[
            EnforceGCContent(mini=gc_min, maxi=gc_max, window=bp_window),
            EnforceTranslation()
        ],
        objectives=[
            CodonOptimize(species=species)
        ],
        logger=None
    )

# This passes!
def test_reproducibility():
    dna_seqs = []

    for i in range(10):
        problem = get_problem(DNA_SEQ)

        problem.resolve_constraints()
        problem.optimize()

        dna_seqs.append(problem.sequence)

    assert len(set(dna_seqs)) == 1

# This fails (optm_dna_seq is copy-pasted from a previous run)
def test_simple_optimization():
    optm_dna_seq = 'ATGACCTTCACCGAGAAGGTGTTTAAGCAGGTGTGCGAATGGTGCATTGGCTTTTATCGCAAATATGCGGTGCCGAGCGAAGAATGCGATTTCTTCCCGCGCTGGATTTGGGGCAACAAACGCAACATTCTGACCTTTACCTTTAGCACCCCAATTACCTTACTGACCGCGCCGAGCCGTTTACCGGAAGAAGGCTGCCAGATTACCATTTGCGGCCAACAGCAAGATTGCTTTGCGGTGGAACGCCATACCCTGCCGCAGAGCACCAAAATTAACCGTTGGTGCGGCCGTTGTGAACATCATAACCAGATTTATATTGAACCGACCGAACCGATTGGCGAACAGCATGCGGATCATTTTGATTGCATTGAGAACAACGCGATTGATGACTACCCGCCGAGCAACCTGCCGTGCTTACAGATTCAAAGCGTTGAAAATGTGGGCCATCGCGGCTGGTGCCAGGTTCTGGTTAGATGGTGGGTGGATGCATATAAAGATGGCGATCCGGTGGCGTGGGGTGATTTTCGCGGTCTGGTTAAATCTGCGGAATGGACTGTGCGCCCACGCAGCAAAGCGGGTGATGAAACCCATCGTGAATGGAAACATCCGAGCCTGGGCTGCCCGCATGGCAACATTATCGTGCTGGCGGCACCAGTTAGCTATCATCTGAACATTGATTGCACCGTGAACGTGCATAACCTGCAGAGCGCGGAATATACCGATCCGAGCTATGCGTTTACCCATGTGTGCATTACCGCGTTTTTTGCGATTCAGGCGATTCCGCATGCGCGCATTAGCTTTAGCAAAGAGGCGAACCGCTATTATTATTGGATCATTACCTGCCACTGCAGCAGCTATTGCTGCCTGCGCGAATGCTGGATTAGCACCCGCAAAGATAAATGGCATAGCCATAAATACAAGGACAGCCCGCTGTTTGTGGAATTTGGCTGCAAAGAAACCTGGTGCCTGGAACATTATGATAAAGAAGTGAGCCAGACCAAACGCTATGGCTGCAAACTGGTGGGCGATCGCTGCGGTTTTCATGAAGCGCTGTTTTTTGAAACCAAATGGCGCAACACCGAACCGCAGCGCCTGTATGCGTATTGGAGCATTAAAGTGGATTGCTGCAAAATCTACAGCGAGAACAACACCAAAAACGTGGATGAGATCGAATGGGCGACCCGCGAATGGATTTGGCAGAAAAACATTAAAATTCGCAGCAACGTGCGTTTTCATTATCAGCCGGACCAGGAACATCCGGATGTGTGGCATGGCCTGTGCGAGCAGCTGGTGACTATTGATAAAAACCATGAACTGCCGGATGGCCAGAAAATTATTTGCTGGGGCAAAGAATGGCATGATGCGATTTACCAGGATATCGCGCCGTTTTATGCGCTGGTGCTGGAATATCGCTGGTTTCAGTGGGGCAGATATGAATGCTGGAAAACCTGGGGCCTGGCGGGCTGTACTGATCTGAATTGGGAAGATCCGTTTCGTGGCACCCATCCGATTCTGCATAAAGGCCAGGGCTATATTAAAGATGCGGCGCTGCTGCGCTGGTTTCGCCATCTGCGTAGCTCTTTTTTTGCGGGTGTTAAAGCGAAAGATGAAGTGGGCTATAACGGCTGCCAGTGGGATTACCGCGATCTGCGCGAAGGCTTCGATTGCTGGCAGTATACCTGCCGCAACTTTCTGAGCCCGAACCAGGTGAACGATTGCTGGTATAAAGGCGCGTGTTGGAAACATACCCCGGGCTGGTTTGTGAAAGTGAAATTTCTGTGGTGCACCAACGAAGCGTATTATTGGGGCCTGAAAACCTGGCATAAAACCCGCCAGATTCATAGCGATGGCACCAACCATGGCCTGACCGAATGCGTGATTACCAGCTTGCGCGCAATTGATCAGCGCCCGAACAGCGATAAAGTGGAAAGCCCAGGTAGAAAATGCGAACGCGTGAGTCGTGTGCATTGCAGCTGGCGTAGCCAGGCGAAAGAAAAAGGCCGTCAGGAAGCGCATTTTTATCTGGGCTGCGTGGTGGGCATTATTGGCCATGACGTGCAGGAACTGGAAATTGACCTGACCACCGATGATATTAGCAGCAACAAAAACTGCCATCATTGCGATCCGAACGCGAAATGCCATTTTCATAAAGGCCAGGTGACCGATGAACATGCGCTGTGCGATGAATGGGTGGTGCTGCGCTTTCCGTATAACACCCAGCGCTGGAAAGCGATTCTGAGCCTGGAAGGCTTTAGCCAGCGCGAATATGATAACTTTCCGCATCTGCAGGCGAAACGCTGGCGCAAAGTGGCGCAGATTGATTATACCTTTGAAAACAACTTTTATTGGCATTGGTGGCAGCACCATGGTGGCAAGAAAGTGATTTGGGCGACCCCGGCGATTACTGCGGGTATTGATTATCAGGATGCGCGCTGCTCTCCAACTGTGGCGCATCACAACAAATTCAGCTGCCATTACAACTGGTGGTTTCTGAAAAGCAGCGAATATGATCAGCATTTCCCGGATACCCAGGTGGTGGTTTTAGATTGGGATTGCGCGCCGAAAGGCCAGCTGGAACCGGTGACCTGCATTCAGACCTTTGATCGCATTCATGCGGATGTGTTTCGCCAGTGGAACGCGCAGTGGTGGTGGGATTTTCGCCTGCAGACCAACTGGTATTGGCATTTTGATAACGATGAATGCCGCCCGAGCCTGCCGTGGCGCGATTGCAACCTGTTACAGTTTTATCTGCAGCTGGGTTGCAAAATTGCGTTTCAGCGTACCTGGCAGAAAACCCATCATCATCGCGTGTGCCCGGTGCCGGTTAGCTGGTTTCGCTGCATTGACAACCAGATTATTTTCCACTTCATTCACGAGAGCCAGTGGACCCAGTCTCGTCAGCGTACTGTGTATGATCCGGTTCCACATCGCCATTGGCATTGTTGGCCGTGGGCGATTTGGCGTAATGGCCGCTATCATTATAAATATAGCACCCCGTAA'

    problem = get_problem(DNA_SEQ)
    problem.resolve_constraints()
    problem.optimize()

    assert problem.sequence == optm_dna_seq

If you run test_reproducibility you get the same sequence for all iterations of the loop. This test passes.

However, I tried running test_simple_optimization once, recording the output and storing it into optm_dna_seq, then running the test again. This almost always fails for large sequences.

I can't tell where the randomization is happening. Is there a different random number generator that I should set the seed for?

Sorry if this is a bit convoluted, couldn't think how else to explain it.

Zulko commented 4 years ago

Hi @lebolo. This is actually a great question, I might have some elements of answer but not sure. What python version are you using?

lebolo commented 4 years ago

I'm using Python 3.7.3

BTW, thanks for the great tool!

Zulko commented 4 years ago

I'll look into this later, but I believe that the use of set(), in particular in the MutationSpace module, can introduce some randomness independently from numpy.seed().

For instance, the result of list(set(['a', 'b', 'c'])) will be consistent inside a same python session, but not between sessions. If this is the problem, then it should be fixable.

lebolo commented 4 years ago

Thanks @Zulko! For my test case above, I swapped out set for OrderedSet (https://pypi.org/project/ordered-set/) in MutationSpace and EnforceTranslation and it seems to be reproducible across Python sessions now.

Zulko commented 4 years ago

Hi @lebolo, I believe I also fixed it on my side by un-seting the variants in choose_random_variants of the MutationSpace module. I am in the process of generating samples for the tests and I'll update Github/PyPI very soon.

Zulko commented 4 years ago

I pushed a version on PyPI that hopefully enforces determinism based on the numpy.random.seed. Let me know how it works for you and please open new issues if you meet new reproducibility problems.

lebolo commented 4 years ago

Thanks Zulko, I updated and tried running mytest_simple_optimization above but it still fails to generate the same sequence across sessions.

One difference between your defined problem and mine is that I have EnforceGCContent(mini, maxi, window) as a constraint whereas you have EnforceGCContent(target) as an objective. If I move my GC content to an objective and change it to a target, it becomes reproducible.

Any thoughts on how to make it reproducible for my use case (where I wanted to constrain that it fits between a GC range)?

Zulko commented 4 years ago

On it.

Zulko commented 4 years ago

I think I got a fix this time. At least it works with your example, and in the tests I created around it. Could you try upgrading and trying again?

What happened is that my previous fixes only ensured determinism in the method for random search, not for exhaustive search (which is used by CodonOptimize). The method MutationSpace.all_variants still relied on a list(set(...)) pattern which is not deterministic across sections. This is now fixed.

lebolo commented 4 years ago

It's reproducible now for all of my tests. Thanks so much for the help!

Zulko commented 4 years ago

Awesome. Reproducibility was also a thorn in my foot (it's not great to debug a non-deterministic environment) so it had been on my todo list for some time. Closing this, but don't hesitate to report any other trouble!

pvgb commented 2 years ago

I am having some issues with reproducibility as well but my application is slightly different. I have 10 sequences each with different point mutations in the protein. I tried to set the SEED to see if I would get the same codons selected for most of the sequence but that doesnt seem to be the case. What would be your suggestion to deal with cases like that? Thanks

veghp commented 2 years ago

See this file on how the seed is set: https://github.com/Edinburgh-Genome-Foundry/DnaChisel/blob/master/tests/builtin_specifications/test_CodonOptimize.py

And this test specifically tests for determinism: https://github.com/Edinburgh-Genome-Foundry/DnaChisel/blob/master/tests/test_determinism/test_determinism.py

If these don't solve the problem, please post a reproducible example and I'll try it