Open VittorioRainaldi opened 1 month ago
This is a great suggestion, we will work on it. Do you mind sharing the protein/dna sequence you ran into issue with?
An option is to use non-deterministic mode to produce many good variants, then use a complexity metric based on those criteria and sort the variants on this metric. We should have a good default metric, but we should also allow users to set the parameters they want (e.g., they don't care about homopolymers so they would not penalize them).
The protein sequence in question is called ecm (uniprot ID Q3IZ90). I am pasting the predicted DNA sequence because as far as I understand torch is not always reproducible across systems:
ATGACTCAGAAAGACTCACCATGGCTGTTCAGGACCTATGCGGGACACAGCACAGCCAAAGCCTCCAATGCGCTGTACCGTACCAACCTGGCGAAAGGTCAGACCGGTCTGAGCGTGGCGTTTGATCTGCCGACCCAGACCGGCTATGACAGCGATGATGCGCTGGCCCGCGGCGAAGTCGGTAAAGTCGGTGTACCGATCTGCCACCTGGGTGACATGCGTATGCTGTTTGACCAGATCCCGCTGGAACAGATGAACACCTCTATGACCATCAATGCCACAGCACCGTGGCTGCTGGCGCTGTACATTGCCGTAGCTGAAGAGCAGGGTGCGGACATCAGCAAACTGCAGGGTACTGTTCAGAATGACCTGATGAAAGAGTATCTCAGCCGTGGCACCTACATCTGCCCGCCGCGTCCATCTCTGCGCATGATCACCGATGTGGCGGCTTACACCCGTGTTCATCTGCCGAAATGGAACCCGATGAACGTCTGCTCTTACCACCTGCAGGAAGCAGGTGCGACACCGGAACAGGAACTGGCGTTTGCGCTGGCCACCGGTATTGCGGTGCTGGATGACCTGCGCACCAAAGTGCCGGCAGAACATTTCCCGGCGATGGTTGGCCGCATCAGCTTCTTCGTTAACGCCGGTATCCGCTTTGTGACCGAAATGTGCAAAATGCGTGCGTTTGTTGACCTGTGGGATGAGATCTGCCGTGACCGTTACGGTATCGAAGAAGAGAAATACCGCCGTTTCCGCTACGGTGTGCAGGTTAACAGCCTGGGCCTGACCGAACAGCAGCCGGAGAACAACGTCTACCGCATCCTGATTGAGATGCTGGCGGTGACCCTGAGCAAGAAAGCGCGTGCGCGTGCTGTTCAGCTGCCGGCGTGGAACGAAGCGCTGGGTCTGCCGCGTCCGTGGGACCAGCAGTGGAGCCTGCGTATGCAGCAGATCCTGGCCTACGAGTCCGACCTGCTGGAGTATGAAGACCTGTTTGATGGTAACCCGGCGATCGAGCGTAAAGTTGAAGCGCTGAAAGACGGTGCGCGTGAGGAGCTGGCGCACATTGAGGCGATGGGTGGTGCGATTGAAGCGATCGACTACATGAAAGCGCGTCTGGTAGAGAGCAATGCCGAGCGTATTGCCCGTGTGGAGACCGGTGAAACCGTGGTGGTCGGTGTGAACCGCTGGACCTCTGGTGCACCATCTCCGCTGACCACTGGTGACGGTGCGATTATGGTTGCTGATCCGGAAGCAGAGCGCGATCAGATTGCCCGTCTGGAAGCATGGCGTGCGGGTCGTGATGGTGCGGCGGTGGCTGCGGCGCTGGCTGAACTGCGCCGTGCGGCGACCTCCGGTGAGAACGTCATGCCGGCCTCTATTGCCGCTGCGAAAGCCGGCGCCACCACCGGTGAATGGGCGGCAGAGCTGCGCCGTGCCTTCGGTGAGTTCCGCGGCCCGACCGGTGTTGCGCGTGCGCCAAGCAACCGCACCGAAGGTCTGGATCCGATCCGTGAAGCGGTTCAGGCGGTCTCCGCGCGTCTGGGCCGTCCGCTGAAATTTGTGGTCGGTAAACCGGGTCTGGATGGCCACTCCAACGGTGCGGAACAGATTGCCGCGCGCGCGCGCGACTGCGGCATGGATATCACCTACGATGGTATCCGCCTGACGCCAGCGGAGATCGTGGCGAAAGCGGCCGATGAGCGCGCGCACGTCCTCGGTCTGTCCATTCTGTCCGGCTCCCACATGCCGCTGGTGACCGAAGTGCTGGCTGAAATGCGCCGCGCGGGTCTGGATGTTCCGCTGATCGTTGGCGGTATCATTCCGGAAGAAGATGCGGCGGAGCTGCGTGCCTCCGGTGTTGCGGCGGTTTACACCCCGAAAGATTTTGAGCTGAACCGCATTATGATGGATATTGTCGGCCTGGTTGACCGCACTGCGCTGGCGGCGGAATAA
This sequence gives the following output on the IDT gblock analysis tool:
Denied - High Complexity (Scores of 10 or greater)
The identified complexities prevent manufacturing of this sequence.
Total Complexity Score: 18
Complexity Description Score
Aside from repeated sequences, I believe a GC content of >60% is highly unlikely for E. coli coding sequences, so I would expect that to be reflected in the model.
Ah, I forgot to mention that I optimized the sequence with the "E. coli general" setting using the code snippet on PyPi.
@VittorioRainaldi While we work on adding user-defined rules and restrictions, could you try non-deterministic generation of multiple sequences using the new version of package to see if it solves your problem?
I tested it with the following settings:
then I calculated the hamming distance of the sequences. While increasing the temperature does lead to more diverse set of sequences, none of the ones I tested passed the IDT screening.
Here is the output I get, first number is GC content, second number is hamming distance.
temperature = 0.2
60.28586013272078 0 60.0816743236345 45 60.13272077590608 47 60.23481368044921 48 60.23481368044921 48 60.23481368044921 48 60.23481368044921 48 60.23481368044921 48 60.23481368044921 48 60.23481368044921 48
temperature 0.5
59.928534966819804 0 60.54109239407861 128 60.0816743236345 103 59.979581419091375 100 60.0816743236345 98 60.28586013272078 99 60.18376722817764 97 60.23481368044921 98 60.23481368044921 98 60.23481368044921 98
temperature 0.8
59.060745278203164 0 59.315977539561004 206 60.13272077590608 195 60.54109239407861 202 60.33690658499234 181 60.18376722817764 186 60.49004594180705 174 59.87748851454824 167 59.826442062276676 175 60.33690658499234 176
By the way, is there a way to calculate how much each sequence differs from the "optimum"? Perhaps an internal scoring function for the model?
P.S.: the hamming distance is not pairwise, I just used the first sequence as a comparison for all of them, that's why the first is always zero.
I tested the tool on a couple of protein sequences and for one of them the predicted DNA sequence is too complex for synthesis for both IDT and twist. Twist has the following rules to determine whether a sequence is too complex:
Output sequences could be screened for such issues and regenerated if needed.