Allow user-defined rules

VittorioRainaldi commented 1 month ago

I tested the tool on a couple of protein sequences and for one of them the predicted DNA sequence is too complex for synthesis for both IDT and twist. Twist has the following rules to determine whether a sequence is too complex:

- Avoid repeats of ≥ 20bp or Tm ≥ 60C
- Global GC content must be between 25% and 65%
- Avoid extreme differences in GC content within a gene (i.e. the difference in GC content between the highest and lowest 50bp stretch should be no greater than 52%)
- Minimize homopolymers
- Minimize the number/length of small repeats scattered throughout the sequence
- For HIS tags use a combination of CAC and CAT codons i.e. CACCAT…

Output sequences could be screened for such issues and regenerated if needed.

Adibvafa commented 1 month ago

This is a great suggestion, we will work on it. Do you mind sharing the protein/dna sequence you ran into issue with?

gui11aume commented 1 month ago

An option is to use non-deterministic mode to produce many good variants, then use a complexity metric based on those criteria and sort the variants on this metric. We should have a good default metric, but we should also allow users to set the parameters they want (e.g., they don't care about homopolymers so they would not penalize them).

VittorioRainaldi commented 1 month ago

The protein sequence in question is called ecm (uniprot ID Q3IZ90). I am pasting the predicted DNA sequence because as far as I understand torch is not always reproducible across systems:

ATGACTCAGAAAGACTCACCATGGCTGTTCAGGACCTATGCGGGACACAGCACAGCCAAAGCCTCCAATGCGCTGTACCGTACCAACCTGGCGAAAGGTCAGACCGGTCTGAGCGTGGCGTTTGATCTGCCGACCCAGACCGGCTATGACAGCGATGATGCGCTGGCCCGCGGCGAAGTCGGTAAAGTCGGTGTACCGATCTGCCACCTGGGTGACATGCGTATGCTGTTTGACCAGATCCCGCTGGAACAGATGAACACCTCTATGACCATCAATGCCACAGCACCGTGGCTGCTGGCGCTGTACATTGCCGTAGCTGAAGAGCAGGGTGCGGACATCAGCAAACTGCAGGGTACTGTTCAGAATGACCTGATGAAAGAGTATCTCAGCCGTGGCACCTACATCTGCCCGCCGCGTCCATCTCTGCGCATGATCACCGATGTGGCGGCTTACACCCGTGTTCATCTGCCGAAATGGAACCCGATGAACGTCTGCTCTTACCACCTGCAGGAAGCAGGTGCGACACCGGAACAGGAACTGGCGTTTGCGCTGGCCACCGGTATTGCGGTGCTGGATGACCTGCGCACCAAAGTGCCGGCAGAACATTTCCCGGCGATGGTTGGCCGCATCAGCTTCTTCGTTAACGCCGGTATCCGCTTTGTGACCGAAATGTGCAAAATGCGTGCGTTTGTTGACCTGTGGGATGAGATCTGCCGTGACCGTTACGGTATCGAAGAAGAGAAATACCGCCGTTTCCGCTACGGTGTGCAGGTTAACAGCCTGGGCCTGACCGAACAGCAGCCGGAGAACAACGTCTACCGCATCCTGATTGAGATGCTGGCGGTGACCCTGAGCAAGAAAGCGCGTGCGCGTGCTGTTCAGCTGCCGGCGTGGAACGAAGCGCTGGGTCTGCCGCGTCCGTGGGACCAGCAGTGGAGCCTGCGTATGCAGCAGATCCTGGCCTACGAGTCCGACCTGCTGGAGTATGAAGACCTGTTTGATGGTAACCCGGCGATCGAGCGTAAAGTTGAAGCGCTGAAAGACGGTGCGCGTGAGGAGCTGGCGCACATTGAGGCGATGGGTGGTGCGATTGAAGCGATCGACTACATGAAAGCGCGTCTGGTAGAGAGCAATGCCGAGCGTATTGCCCGTGTGGAGACCGGTGAAACCGTGGTGGTCGGTGTGAACCGCTGGACCTCTGGTGCACCATCTCCGCTGACCACTGGTGACGGTGCGATTATGGTTGCTGATCCGGAAGCAGAGCGCGATCAGATTGCCCGTCTGGAAGCATGGCGTGCGGGTCGTGATGGTGCGGCGGTGGCTGCGGCGCTGGCTGAACTGCGCCGTGCGGCGACCTCCGGTGAGAACGTCATGCCGGCCTCTATTGCCGCTGCGAAAGCCGGCGCCACCACCGGTGAATGGGCGGCAGAGCTGCGCCGTGCCTTCGGTGAGTTCCGCGGCCCGACCGGTGTTGCGCGTGCGCCAAGCAACCGCACCGAAGGTCTGGATCCGATCCGTGAAGCGGTTCAGGCGGTCTCCGCGCGTCTGGGCCGTCCGCTGAAATTTGTGGTCGGTAAACCGGGTCTGGATGGCCACTCCAACGGTGCGGAACAGATTGCCGCGCGCGCGCGCGACTGCGGCATGGATATCACCTACGATGGTATCCGCCTGACGCCAGCGGAGATCGTGGCGAAAGCGGCCGATGAGCGCGCGCACGTCCTCGGTCTGTCCATTCTGTCCGGCTCCCACATGCCGCTGGTGACCGAAGTGCTGGCTGAAATGCGCCGCGCGGGTCTGGATGTTCCGCTGATCGTTGGCGGTATCATTCCGGAAGAAGATGCGGCGGAGCTGCGTGCCTCCGGTGTTGCGGCGGTTTACACCCCGAAAGATTTTGAGCTGAACCGCATTATGATGGATATTGTCGGCCTGGTTGACCGCACTGCGCTGGCGGCGGAATAA

This sequence gives the following output on the IDT gblock analysis tool:

Denied - High Complexity (Scores of 10 or greater)

The identified complexities prevent manufacturing of this sequence.

Total Complexity Score: 18

Complexity Description Score

One or more repeated sequences greater than 8 bases comprise 61.9% of the overall sequence. Solution: Redesign to reduce the repeats to be less than 40% of the sequence. 8.8
The GC content of the segment from position 1001 to position 1800 is 64.2%. Solution: Redesign to reduce the GC content below 60%. 4.2
This sequence contains a window of 100 bases starting at base 1399 with a GC content of 74%. Solution: Redesign this region to have a GC content less than 69%. 4
A hairpin with the stem sequence CAGATGAACAC exists at the following locations: 250, 458. Solution: Modify the sequence to reduce the length of the stem or complement to less than 10 bases. 1

Aside from repeated sequences, I believe a GC content of >60% is highly unlikely for E. coli coding sequences, so I would expect that to be reflected in the model.

VittorioRainaldi commented 1 month ago

Ah, I forgot to mention that I optimized the sequence with the "E. coli general" setting using the code snippet on PyPi.

Adibvafa commented 1 month ago

@VittorioRainaldi While we work on adding user-defined rules and restrictions, could you try non-deterministic generation of multiple sequences using the new version of package to see if it solves your problem?

VittorioRainaldi commented 1 month ago

I tested it with the following settings:

temperature = 0.2, 0.5, and 0.8
top_p = 0.95
num_sequences = 10

then I calculated the hamming distance of the sequences. While increasing the temperature does lead to more diverse set of sequences, none of the ones I tested passed the IDT screening.

Here is the output I get, first number is GC content, second number is hamming distance.

temperature = 0.2

60.28586013272078 0 60.0816743236345 45 60.13272077590608 47 60.23481368044921 48 60.23481368044921 48 60.23481368044921 48 60.23481368044921 48 60.23481368044921 48 60.23481368044921 48 60.23481368044921 48

temperature 0.5

59.928534966819804 0 60.54109239407861 128 60.0816743236345 103 59.979581419091375 100 60.0816743236345 98 60.28586013272078 99 60.18376722817764 97 60.23481368044921 98 60.23481368044921 98 60.23481368044921 98

temperature 0.8

59.060745278203164 0 59.315977539561004 206 60.13272077590608 195 60.54109239407861 202 60.33690658499234 181 60.18376722817764 186 60.49004594180705 174 59.87748851454824 167 59.826442062276676 175 60.33690658499234 176

By the way, is there a way to calculate how much each sequence differs from the "optimum"? Perhaps an internal scoring function for the model?

P.S.: the hamming distance is not pairwise, I just used the first sequence as a comparison for all of them, that's why the first is always zero.

Adibvafa / CodonTransformer

Allow user-defined rules #7