Edinburgh-Genome-Foundry / DnaChisel

:pencil2: A versatile DNA sequence optimizer
https://edinburgh-genome-foundry.github.io/DnaChisel/
MIT License
213 stars 38 forks source link

constraints_text_summary() says AvoidPattern(RepeatedKmerPattern(2, 9)) passed, but it didn't #76

Closed Lcarey closed 3 months ago

Lcarey commented 5 months ago

Hi, I'm using DNAchisel to encode some difficult to synthesize proteins. I tried removing all kmer repeats, and constraints_text_summary() says pass, but I'm able to find a repeat. What am I not understanding about how to codon optimize?

for example, in the below code, there's a 26nt repeat that doesn't get removed.

thanks for helping me out! -Lucas

output:

repeated kmers: ('CCGGATCATATGAAACAGCATGACTT', 2) 26
[222, 936]
===> SUCCESS - all constraints evaluations pass
✔PASS ┍ EnforceTranslation[0-2856]
      │ Enforced by nucleotides restrictions
✔PASS ┍ AvoidPattern[0-2856](pattern:BsaI(GGTCTC))
      │ Passed. Pattern not found !
✔PASS ┍ AvoidPattern[0-2856](pattern:BbsI(GAAGAC))
      │ Passed. Pattern not found !
✔PASS ┍ AvoidPattern[0-2856](pattern:XhoI(CTCGAG))
      │ Passed. Pattern not found !
✔PASS ┍ AvoidPattern[0-2856](pattern:NcoI(CCATGG))
      │ Passed. Pattern not found !
✔PASS ┍ AvoidRareCodons[0-2856]
      │ Enforced by nucleotides restrictions
✔PASS ┍ EnforceGCContent[0-2856](mini:0.38, maxi:0.62, window:100)
      │ Passed !
✔PASS ┍ AvoidPattern[0-2856](pattern:2x9mer)
      │ Passed. Pattern not found !

===> TOTAL OBJECTIVES SCORE:    -33.76
    -33.76 ┍ MatchTargetCodonUsage[0-2856](e_coli) 
           │ Codon opt. on window 0-2856 scored -3.38E+01

make sure the encoding is correct:  True

code:

from pathlib import Path 
import pandas as pd
from dnachisel import (
    DnaOptimizationProblem,
    translate,
    reverse_translate,
    CodonOptimize,
    EnforceTranslation,
    AvoidPattern,
    EnforceGCContent,
    RepeatedKmerPattern,
)

from dnachisel import *
import re
from Bio.Restriction import *
from Bio.Seq import Seq
from collections import defaultdict

#### functions to find longest repeated substring ### 
def getsubs(loc, s):
    substr = s[loc:]
    i = -1
    while(substr):
        yield substr
        substr = s[loc:i]
        i -= 1

def longestRepetitiveSubstring(r, minocc=3):
    occ = defaultdict(int)
    # tally all occurrences of all substrings
    for i in range(len(r)):
        for sub in getsubs(i,r):
            occ[sub] += 1

    # filter out all substrings with fewer than minocc occurrences
    occ_minocc = [k for k,v in occ.items() if v >= minocc]

    if occ_minocc:
        maxkey =  max(occ_minocc, key=len)
        return maxkey, occ[maxkey]
    else:
        raise ValueError("no repetitions of any substring of '%s' with %d or more occurrences" % (r,minocc))

#### encode 4x sfGFP ### 

ntseq = 'ATGCGTAAAGGCGAAGAGCTGTTCACTGGTGTCGTCCCTATTCTGGTGGAACTGGATGGTGATGTCAACGGTCATAAGTTTTCCGTGCGTGGCGAGGGTGAAGGTGACGCAACTAATGGTAAACTGACGCTGAAGTTCATCTGTACTACTGGTAAACTGCCGGTACCTTGGCCGACTCTGGTAACGACGCTGACTTATGGTGTTCAGTGCTTTGCTCGTTATCCGGACCATATGAAGCAGCATGACTTCTTCAAGTCCGCCATGCCGGAAGGCTATGTGCAGGAACGCACGATTTCCTTTAAGGATGACGGCACGTACAAAACGCGTGCGGAAGTGAAATTTGAAGGCGATACCCTGGTAAACCGCATTGAGCTGAAAGGCATTGACTTTAAAGAAGACGGCAATATCCTGGGCCATAAGCTGGAATACAATTTTAACAGCCACAATGTTTACATCACCGCCGATAAACAAAAAAATGGCATTAAAGCGAATTTTAAAATTCGCCACAACGTGGAGGATGGCAGCGTGCAGCTGGCTGATCACTACCAGCAAAACACTCCAATCGGTGATGGTCCTGTTCTGCTGCCAGACAATCACTATCTGAGCACGCAAAGCGTTCTGTCTAAAGATCCGAACGAGAAACGCGATCATATGGTTCTGCTGGAGTTCGTAACCGCAGCGGGCATCACGCATGGTATGGATGAACTGTACAAA'
ntseq = ntseq + ntseq + ntseq + ntseq 

constraints=[
            EnforceTranslation(),
            AvoidPattern("BsaI_site"),
            AvoidPattern("BbsI_site"),
            AvoidPattern("XhoI_site"),
            AvoidPattern("NcoI_site"),
            AvoidRareCodons(0.1, species='e_coli'),
            EnforceGCContent(mini=0.38, maxi=0.62, window=100),
            AvoidPattern(RepeatedKmerPattern(2, 9)),
            ]

problem = DnaOptimizationProblem(
    sequence=reverse_translate(protein_sequence=translate(ntseq),table='Standard'),
    constraints=constraints,
    objectives=[CodonOptimize(species='e_coli', method='match_codon_usage')],  
)

problem.max_random_iters = 2000000
problem.resolve_constraints()
problem.optimize()

new_encoding = problem.sequence

# find longest repeated substring
found_this = longestRepetitiveSubstring(new_encoding, minocc=2)
print("repeated kmers:" , found_this, len(found_this[0]))
print([match.start() for match in re.finditer(found_this[0], new_encoding)])

# how does DNAChisel think it did? 
print(problem.constraints_text_summary())
print(problem.objectives_text_summary())

print('make sure the encoding is correct: ', translate(new_encoding) == translate(ntseq))
veghp commented 5 months ago

I think the issue is that you're using RepeatedKmerPattern which is for direct repeats. Try UniquifyAllKmers.

See this example: https://github.com/Edinburgh-Genome-Foundry/DnaChisel/blob/4be08c2696710387ae6d5b0346e6680a69d88c64/examples/common_scenarios/circular_sequence.py#L25

Zulko commented 5 months ago

Peter is right, from your definition of longestRepetitiveSubstring it looks like you want UniquifyAllKmers(9) (note that by default it also looks for homologies on the reverse-complement).

Beware that this is one of the rare specifications in DnaChisel that can fail for complicated problems even though these problems may have indeed a solution (this is particularly true when the initial sequence already contains a lot of repeated elements). Some tips:

If UniquifyAllKmers(N) fails as a constraint (with a NoSolutionError) you could try running it as an objective: objectives=[CodonOptimize(...), UniquifyAllKmers(N, boost=100)]. You can also try shaking up the initial sequence with reverse_translate(protein_sequence, randomize_codons=True). It can make things better, or worse. Last resort, increase the k-mer length.

Lcarey commented 5 months ago

Hi Peter & Zulko,

Perfect. Thanks for the rapid response. like always with your code, I should have read the documentation

have a good evening, and thank you so much!

-Lucas