Edinburgh-Genome-Foundry / DnaChisel

:pencil2: A versatile DNA sequence optimizer
https://edinburgh-genome-foundry.github.io/DnaChisel/
MIT License
213 stars 38 forks source link

won't use codon which usage frequency below a threshold #22

Closed Lix1993 closed 4 years ago

Lix1993 commented 4 years ago

for codons which usage frequency below a threshold (such as 0.1), set these usage to 0,

since get_codons_table()is a staticmethod , use remove_codons_below_threshold() when codons_usage_threshold > 0

coveralls commented 4 years ago

Coverage Status

Coverage decreased (-0.4%) to 89.279% when pulling 7175056db01e03652119fa3649be6b1354d548c2 on Lix1993:master into 4e61ff13fed0938444ba6108ed4c0e29278ad2b8 on Edinburgh-Genome-Foundry:master.

Zulko commented 4 years ago

Thanks for the PR, I would accept it, but could you explain or give a reference explaining the need ? Is it a common thing to strictly avoid rare codons?

Lix1993 commented 4 years ago

remove rare codons will increase proterin expression, based on our experimental results

Zulko commented 4 years ago

I am not against the feature, but I have two objections to the implementation:

Therefore I would suggest, instead of adding a parameter to CodonOptimize, to create a new specification AvoidRareCodons(species="e_coli", min_frequency=0.20) that can be used in addition of CodonOptimize (or instead of CodonOptimize).

Would that make sense?

Lix1993 commented 4 years ago

I think it may help.

In addition, when dealing with multiple optimization problems, boots cannot represent problem weights since their scores are not in same range.
For example: codon_usage may get a score -200 while avoid_hairpin get a score below 10.
Do you have any suggestion to make optimization problem more 'equal'?

Zulko commented 4 years ago

Ok, I am working on the library right now so I'll add a AvoidRareCodons specification, which you'll be able to use on top of other optimization methods.

Regarding the objectives scores and weights, it is true that different optimization objectives have typical scores in different ranges, as they are not always easy to compare with one another, and there is no other way right now than to play around with the boost parameter. I recognize this is an issue and I am open to suggestions. Right now, specification scores are designed so that, ideally, a nucleotide mutation should contribute between 0 and +1 to the overall score. However, that doesn't make every specification "comparable". Let me know if you have a particular example in mind where this could be a problem.

Zulko commented 4 years ago

closing this in favor of the new AvoidRareCodons specification class.

Lix1993 commented 4 years ago

Thanks for your help.

Lix1993 commented 4 years ago

I haven't dealing with multi-objectives right now. But it's my purpose.

Our goal is to optimize a cds sequence using different objectives with a uniform weight .
We will then use experiments to determine which functions primarily affect protein expression. Then we'll reoptimize sequence with difference weight.

I'm currently being dealing with specific features's evaluate function. Such as TAI, leading peptide...
When hand on multi-objective problem, I'll paste an examples here.

Sorry for my poor English..

Zulko commented 4 years ago

No worries, it is all clear and I really appreciate your suggestions, let me know if you run into more problems or improvement ideas.