Edinburgh-Genome-Foundry / DnaChisel

:pencil2: A versatile DNA sequence optimizer
https://edinburgh-genome-foundry.github.io/DnaChisel/
MIT License
213 stars 38 forks source link

Other optimization policies exist? #2

Closed y9c closed 4 years ago

y9c commented 6 years ago

https://github.com/Edinburgh-Genome-Foundry/DnaChisel/blob/a318ca56f1731c7bffa6322b7648e0e5c237dd9f/dnachisel/builtin_specifications/CodonOptimize.py#L12-L14

Zulko commented 6 years ago

There seems to be several accepted ways to optimize codons. Here is an example: In the E. coli genome, the amino acid Glu is encoded by GAA 70% of the time and by GAG 30% of the time. When codon optimizing a gene for E. coli, you can either:

  1. Use GAA for Glu, all the time or as much as possible, as this seems to be the preferred amino acid.
  2. Use GAA 70% of the time and by GAG 30% of the time, or as close to this as possible.

As of now, DnaChisel implements the strategy (1) only.

Does this answer your question ?

y9c commented 6 years ago

@Zulko My question is whether there are options to use strategy(2) or any other strategies. If not, will it be added in the future release?

Zulko commented 6 years ago

They will certainly be in the future. Do you need this feature now ? Or in what time frame ?

y9c commented 6 years ago

@zulko I need this feature now. I don't think all best codon is the best solution.

Zulko commented 6 years ago

I believe this is now adressed with the last commit. The Github version (and soon the PyPI version) now have a parameter mode='best_codon' or mode='harmonized' for the specification CodonOptimize. For best_codon, the optimization will always replace a codon with the most-frequent triplet possible. For harmonized, the optimization will bring the relative frequencies of the different triplets as close as possible as the frequencies in the reference species (this is also known as codon harmonization).

Feel free to open no issues if this was not satisfactory.

picousse commented 5 years ago

I'm not an expert on this but isn't harmonization mapping over the host codon usage table with the destination host codon usage table? and then if e.g. a codon is rare in the original host, it should also be rare in the destination host?

If there is ribosome pausing in the original host, this should then also be the case in the destination host.

As far as I inderstood, the current harmonization algorithm randomly picks a codon, and in the end the distribution over the different codons should be as close as possible with the host codon table? Biology wise this seems not interesting. I think it is far more important to mimic the original translational process: ribosome pausing allows for folding of the part that is already transcribed, ...

But to conclude, this would be a nice have, but currently it is not implemented, if I understood correctly?

picousse commented 5 years ago

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2364656/

Zulko commented 5 years ago

That makes sense. How would it work in practice (I may not have time for reading through this) ? Is it just replacing the most-used codon in the original host by the most used codon in the target host (and same for second-most used, etc.), then same for second-most used codon, etc. ?

picousse commented 5 years ago

Hi, Sorry, but I'm not the expert here. Something like that yes, but I'm afraid that if you don't take any percentages as well into calculation, you could get some weird results. E.g. Let's say you have codon usage of 80% 10% and 10% in the original host and you would map it purely based on ranking to something like 50%45%5% in the destination host, you're pretty far off... I should dig deeper into these articles and algorithms to exactly know how they work.

A nice database that might come in handy is https://www.kazusa.or.jp/codon/ Or the possibility to feed a genbank file with all protein sequences of an organism (from e.g. ncbi).

Zulko commented 5 years ago

Yeah I get your point for the percentage, it seems to be what the papers are explaining. I am not sure when I will get to it, but it will certainly be a separate Specification called CodonHarmonize(original_host=, target_host=) where you can provide species names or codon tables. I am against supporting "whole genome" as an input, because it is very easy to build a table from a genome sequence anyways, then feed the table to CodonHarmonize.

picousse commented 5 years ago

Couldn't agree more. Make sense.

Thanks for the heads up. I might give it a shot as well, although I'm not a trained programmer.

ghost commented 5 years ago

a new article popped up (still need to go through it completly) with design principles. https://www.nature.com/articles/nbt.4238

Zulko commented 5 years ago

Looks cool, would you mind summarizing the principles in this thread ? (I will have a busy week). One of the authors, Joao Guimaraes, wrote a sequence optimizer called D-tailor with cool ideas around 5 years ago. Maybe there will be software associated to this paper too ?

ghost commented 5 years ago

Hi, indeed they are using D-tailor. Wasn't aware of that one. Still reading. Bit jealous on the article. Worth multiple PhD... the amount of data alone...

jjs6w commented 5 years ago

I used the best codon optimization while avoiding some enzyme sites. The synthesized protein was good enough expression for me in human cell lines. It seems the article is saying codon usage is not so important for expression and it is just the begin parts of the mRNA that seem matter. It's all based on E.coli observations, so I'm not sure how applicable it is to other species.

Thanks for making and sharing this package Zulko. It has saved me a lot of time.

Zulko commented 5 years ago

Thanks for the comment @jjs6w that's very good to know !

Zulko commented 4 years ago

Thanks everyone for the contributions, I am closing this as Chisel now has 3 different codon optimization classes ("use best codon", "match target codon usage", and "harmonize codons from host to target, using RCA"). See this section in the docs.