Edinburgh-Genome-Foundry / DnaChisel

:pencil2: A versatile DNA sequence optimizer
https://edinburgh-genome-foundry.github.io/DnaChisel/
MIT License
219 stars 40 forks source link

Suggestion: add design method like d-tailor #31

Closed Lix1993 closed 4 years ago

Lix1993 commented 4 years ago

It useful to design seq for experiment. Besides, I think it can solve #27

Lix1993 commented 4 years ago

As described in d tailor tutorial:

In D-Tailor, a class defining a design objective extends the abstract class Design and there are already four predefined methods: • Optimization—only one specific combination of property scores is desired. For example, to increase the expression of a given gene, we may want to design a sequence with high CAI, strong binding between SD and the 16S rRNA and weak mRNA secondary structure around the initiation region. • FullFactorial—all possible combinations between the levels of the different properties are generated. This methodology is appropriate to systematically vary the multiple properties and quantify their effect the observed phenotype. • CustomDesign—this is a more flexible design where the user can indicate each combination of property scores that he/she wants to design for. • RandomSampling—this method does not enforce any particular combination of properties a priori. It can be used to generate a predetermined number of new sequence variants and observe how they scatter across the property space.

Zulko commented 4 years ago

Thanks for the suggestion, here are my thoughts on this so far:

DNA Chisel and D-tailor solve different problems: Chisel is about quickly converging to an optimal solution, even under complex specifications, and Tailor is about creating sets of sequences with different fitness with respect to objectives. As a consequence, the frameworks work differently. For your problem, I can see several approaches (here ranked from least work on DNA Chisel to most work):

  1. Define specifications with a "target score" which you can tune, as I discuss in #27. For instance, instead of MaximizeCAI, you would have an objective TuneCAI(target=some_score). With this you could obtain a sequence with the fitness you want. However, you would need to do one full optimization for every sequence you want to obtain. So if your goal is to generate hundreds of sequences, this could be much slower than Tailor.
  2. To just create (unguided) variability between sequences you can generate sequences iteratively and make sure that each sequence is very different from any previously generated sequence. Like in this example, where a collection of different primers are generated from the same specifications. This is admitedly a naive solution, but it could work in your case
  3. Find a way to convert the DNA Chisel specifications you need into D-tailor specifications, and use D-Tailor directly. This would be logical, since Tailor specializes in this kind of problem. I am not sure how well it would work.
  4. Port the method used in Tailor to DNA Chisel. This could be by adding a new method find_multiobjective_variants to DnaOptimizationProblem, or (simpler in a first step) writing an extension of DNA Chisel (e.g. a Design class like you suggest, and maybe a Solver class) which implement the new search methods. You could still take advantage of Chisel's methods for constraining and generating mutations, checking hard constraints, or suggesting suboptimal regions, but the rest of the algorithm would be different.

There is also the problem of defining which score is "best", "good", "average", or "bad". In D-tailor, this is done by taking a big dataset of real-life sequences (e.g. all genes in E. coli) and looking at the distribution of the scores in the sequences. This would be doable with DNA Chisel too, however in DNA Chisel many scores depend on sequence length (for instance longer genes would most probably get worst scores as they would have more suboptimal regions), This is on purpose to guide global and local optimizations, but it makes the specifications less adapted to look at score distributions in a set of sequences, as Tailor does. You would need to redefine these DNA Chisel specifications a bit so they would be sequence-length-independant.

I'm tagging @jcg (who developed D-tailor) on this issue for awareness and possible suggestions.

Lix1993 commented 4 years ago

Since d-tailor is written in py2 and not update since 2013, add a objectiveDesignMixin may be a better way to solve this. I'll work on this.

Lix1993 commented 4 years ago

I wrote a prototype for this, at https://github.com/Lix1993/DnaChisel/tree/design It should be re-organized,but it can work now.

I'm looking for a job now, so I may improve it in the future.

Zulko commented 4 years ago

That sounds great, it looks like you are following the D-Tailor naming and methods, let us know how it works in real life! As there are many files in the module, it could become a library of its own, so users could get D-tailor features from DnaChisel-compatible constraints and objectives.

Did I understand correctly that you are looking for a job? Is it in the computational biology area?

Lix1993 commented 4 years ago

yes,

In bioinformatics or computational biology area.

Lix1993 commented 4 years ago

Here is a simple resume for me.

Do you have any suggestions?