Edinburgh-Genome-Foundry / DnaChisel

:pencil2: A versatile DNA sequence optimizer
https://edinburgh-genome-foundry.github.io/DnaChisel/
MIT License
213 stars 38 forks source link

CodonOptimize mode 'harmonized' only optimizes first 1/3 of the sequence #10

Closed ValentijnBroeken closed 5 years ago

ValentijnBroeken commented 5 years ago

Using the CodonOptimize 'harmonized' functionality, I do not get a codon distribution similar to the one I specified. It seems that the second 2/3s of a sequence are never optimized.

A minimal example:

from dnachisel import *

problem = DnaOptimizationProblem(
            sequence='GACGACGACAAAAAAAAAAAAAAAAAA',
            constraints=[EnforceTranslation()],
            objectives=[CodonOptimize(species='b_subtilis', mode='harmonized')]
           )

problem.resolve_constraints()
problem.optimize()

print('SEQUENCE:', problem.sequence)

frequencies, positions = biotools.biotools.codons_frequencies_and_positions(problem.sequence)
print(*frequencies.items(), sep='\n')

This sequence encodes DDDKKKKKK and gives as output:

SEQUENCE: GACGATGATAAAAAAAAAAAAAAAAAA
('V', {'total': 0, 'GTA': 0.0, 'GTT': 0.0, 'GTC': 0.0, 'GTG': 0.0})
('L', {'CTT': 0.0, 'TTA': 0.0, 'CTC': 0.0, 'CTG': 0.0, 'CTA': 0.0, 'total': 0, 'TTG': 0.0})
('Q', {'CAG': 0.0, 'total': 0, 'CAA': 0.0})
('G', {'total': 0, 'GGC': 0.0, 'GGA': 0.0, 'GGT': 0.0, 'GGG': 0.0})
('P', {'CCA': 0.0, 'total': 0, 'CCT': 0.0, 'CCG': 0.0, 'CCC': 0.0})
('A', {'total': 0, 'GCG': 0.0, 'GCC': 0.0, 'GCT': 0.0, 'GCA': 0.0})
('T', {'total': 0, 'ACT': 0.0, 'ACA': 0.0, 'ACG': 0.0, 'ACC': 0.0})
('C', {'total': 0, 'TGT': 0.0, 'TGC': 0.0})
('S', {'AGC': 0.0, 'AGT': 0.0, 'TCT': 0.0, 'total': 0, 'TCG': 0.0, 'TCC': 0.0, 'TCA': 0.0})
('N', {'AAC': 0.0, 'total': 0, 'AAT': 0.0})
('H', {'CAC': 0.0, 'total': 0, 'CAT': 0.0})
('I', {'total': 0, 'ATA': 0.0, 'ATT': 0.0, 'ATC': 0.0})
('D', {'GAT': 0.6666666666666666, 'total': 3, 'GAC': 0.3333333333333333})
('M', {'total': 0, 'ATG': 0.0})
('F', {'TTT': 0.0, 'total': 0, 'TTC': 0.0})
('R', {'CGG': 0.0, 'AGG': 0.0, 'CGC': 0.0, 'CGA': 0.0, 'total': 0, 'AGA': 0.0, 'CGT': 0.0})
('K', {'total': 6, 'AAG': 0.0, 'AAA': 1.0})
('E', {'total': 0, 'GAG': 0.0, 'GAA': 0.0})
('*', {'total': 0, 'TAA': 0.0, 'TGA': 0.0, 'TAG': 0.0})
('Y', {'total': 0, 'TAC': 0.0, 'TAT': 0.0})
('W', {'total': 0, 'TGG': 0.0})

It does well for Aspartic Acid (D), as this has a GAT 0.64 / GAC 0.36 ratio, but for K with a AAA 0.7 / AAG 0.3 ratio, it does nothing at all. This is position dependent, as a sequence encoding KKKDDDDDD does the opposite.

While trying to debug, I printed all variants that were used during exhaustive search for this sequence (line 454 of DnaOptimizationProblem.py) , which are:

GACGACGACAAAAAAAAAAAAAAAAAA
GATGACGACAAAAAAAAAAAAAAAAAA
GATGACGACAAAAAAAAAAAAAAAAAA
GATGACGATAAAAAAAAAAAAAAAAAA
GATGACGATAAAAAAAAAAAAAAAAAA
GATGATGATAAAAAAAAAAAAAAAAAA

This again points to the fact that the later positions are never included in the variants to be analysed.

I'm using the latest version, dnachisel 1.1.

It would be great if you could find out where the problem is and come up with a quick fix, so I can use your library for my thesis. Thank you!

Zulko commented 5 years ago

Thanks for the nice report, I am convinced that something is weird. I'll try to sort it our asap (there is a chance that it is due to the special way codon harmonization is defined and optimized in dnachisel. It is unclear to me what to do with codon harmonization, see #2 in case you have a comment on the issue).

Zulko commented 5 years ago

That's an actual bug and I found it (indices that referred to codon positions were wrongly used as nucleotide positions, so for instance the 8th codon was considered at position 8 in the sequence, instead of ~24). That's why only the first third was considered needing optimization.

I'll do a few more fixes and add tests and push that.

ValentijnBroeken commented 5 years ago

Thank you very much for your quick response! Great you directly managed to find the bug.

Regarding #2: The method doesn't appear to me as codon harmonization, which for each amino acid replaces the most used codon from the original organism by the most used codon of the receiving organism, and the second by the second, etc. That thus depends on the codon usage table from the donor organism as well, and would not (necessarily) map to the exact codon distribution of the species specified. I'd say the current 'harmonized' mode is more a mode mimicking the distribution, so maybe 'distribution' would be a better name.

One way or the other, it's a very useful method to have.

Zulko commented 5 years ago

Ok, I am starting to get some clarity on this. In your case, are you after codon harmonization (based on rankings in respective organisms) or codon redistribution (having codon frequencies match)?

ValentijnBroeken commented 5 years ago

I'm mainly interested in the codon redistribution, so the method already implemented.

Zulko commented 5 years ago

Ok this is fixed on Github and PyPI. I have also changed the code so that now you have the convenient compare_frequencies method to check the final result:

from dnachisel import *

protein = "DDDKKKKKK"
sequence = reverse_translate(protein)
harmonization = CodonOptimize(species='b_subtilis', mode='harmonized')
problem = DnaOptimizationProblem(
            sequence=sequence,
            constraints=[EnforceTranslation()],
            objectives=[harmonization]
           )

print ('Sequence_before:', sequence)
problem.optimize()
print ('New sequence:', problem.sequence)

comparison = harmonization.compare_frequencies(problem.sequence, text_mode=True)
print (comparison)

Output:

Sequence_before: GACGACGACAAAAAAAAAAAAAAAAAA
New sequence: GACGATGATAAAAAGAAAAAAAAAAAG

  K: 
    total: 6
    AAA: 
      sequence: 0.67
      table: 0.7

    AAG: 
      sequence: 0.33
      table: 0.3

  D: 
    total: 3
    GAC: 
      sequence: 0.33
      table: 0.36

    GAT: 
      sequence: 0.67
      table: 0.64

Note that for this particular optimization there is a risk that DnaChisel introduces a bit of codon bias, because it solves left-to-right (which may cause spatial bias) and because reverse_translate always uses the same codons. to be clear, I am not certain this is the case. In any case I have added an option to randomize the codons when reverse-translating the protein sequence: reverse_translate(sequence, randomize_codons=True).

Let me know if that works for you!

Zulko commented 5 years ago

Regarding the naming, I am thinking of changing the terms in the future to "rank_harmonize" (match codon ranks in original host and target) and "frequence_harmonize" (the current algorithm). I'll make the necessary version bumps and warnings.

Zulko commented 5 years ago

Hey did that work for you, can I close this issue?

ValentijnBroeken commented 5 years ago

My apologies! Hadn't had the chance to run the full comparison with the new method yet. Yes, it works like a charm, thank you very much for the quick help.

I now tested the difference between the reverse_translate with or without the randomize_codons=True (which is by itself a very nice function to have). Both seem to work equally fine, without any obvious biases to me. The order in which the optimize_objective function optimizes the codons (i.e. the locations list) is namely randomized, so both work equally well. But please keep both functionalities of the reverse_translate function anyway!

I'll close the issue. Thanks again!