Closed francescopatane96 closed 1 year ago
ESM models are encoder-only models so they're not explicitly designed for sequence generation. However I've been experimenting with this a little bit and you can coax them into generating somewhat plausible sequences by providing an input "seed" sequence, replacing a fraction of the residues with mask tokens, and then asking the model to replace those masked tokens with amino acid residues.
I have some data on this but in my experiments, if you use ESM-2 650M or 3B you can mask up to 50% of a 280aa sequence and still get sequences which ESMFold will correctly fold into a structure.
I can share some code to help out with this if you'd like
However I've been experimenting with this a little bit and you can coax them into generating somewhat plausible sequences by providing an input "seed" sequence, replacing a fraction of the residues with mask tokens, and then asking the model to replace those masked tokens with amino acid residues.
Indeed, see also https://elifesciences.org/articles/79854 where we did this iteratively using MSA Transformer -- one of the ideas being that with an MSA-based transformer you can be reasonably confident that your generated sequences will still belong to the same protein family.
That is indeed a good point - I've also experimented with fine-tuning ESM-2 650M on a protein family and then using that to generate sequences too which achieves a similar aim
ESM models are encoder-only models so they're not explicitly designed for sequence generation. However I've been experimenting with this a little bit and you can coax them into generating somewhat plausible sequences by providing an input "seed" sequence, replacing a fraction of the residues with mask tokens, and then asking the model to replace those masked tokens with amino acid residues.
I have some data on this but in my experiments, if you use ESM-2 650M or 3B you can mask up to 50% of a 280aa sequence and still get sequences which ESMFold will correctly fold into a structure.
I can share some code to help out with this if you'd like
Thank you very much and thank you very much too, lupo. You would be really kind if you could pass me an example of code
@francescopatane96 our code (based on MSA Transformer) is here: https://github.com/Bitbol-Lab/Iterative_masking. Umberto
@francescopatane96 our code (based on MSA Transformer) is here: https://github.com/Bitbol-Lab/Iterative_masking. Umberto
Thank you, Umberto. Very interesting work :)
hi! i'm trying to apply methods similar to SMOTE and i would like to generate similar sequences from my training set with ESM models. how can i do it? thank you
Francesco, University of Padova