facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.16k stars 627 forks source link

How to use ESM models to generate new protein sequences similar to the training set #574

Closed francescopatane96 closed 1 year ago

francescopatane96 commented 1 year ago

hi! i'm trying to apply methods similar to SMOTE and i would like to generate similar sequences from my training set with ESM models. how can i do it? thank you

Francesco, University of Padova

naailkhan28 commented 1 year ago

ESM models are encoder-only models so they're not explicitly designed for sequence generation. However I've been experimenting with this a little bit and you can coax them into generating somewhat plausible sequences by providing an input "seed" sequence, replacing a fraction of the residues with mask tokens, and then asking the model to replace those masked tokens with amino acid residues.

I have some data on this but in my experiments, if you use ESM-2 650M or 3B you can mask up to 50% of a 280aa sequence and still get sequences which ESMFold will correctly fold into a structure.

I can share some code to help out with this if you'd like

ulupo commented 1 year ago

However I've been experimenting with this a little bit and you can coax them into generating somewhat plausible sequences by providing an input "seed" sequence, replacing a fraction of the residues with mask tokens, and then asking the model to replace those masked tokens with amino acid residues.

Indeed, see also https://elifesciences.org/articles/79854 where we did this iteratively using MSA Transformer -- one of the ideas being that with an MSA-based transformer you can be reasonably confident that your generated sequences will still belong to the same protein family.

naailkhan28 commented 1 year ago

That is indeed a good point - I've also experimented with fine-tuning ESM-2 650M on a protein family and then using that to generate sequences too which achieves a similar aim

francescopatane96 commented 1 year ago

ESM models are encoder-only models so they're not explicitly designed for sequence generation. However I've been experimenting with this a little bit and you can coax them into generating somewhat plausible sequences by providing an input "seed" sequence, replacing a fraction of the residues with mask tokens, and then asking the model to replace those masked tokens with amino acid residues.

I have some data on this but in my experiments, if you use ESM-2 650M or 3B you can mask up to 50% of a 280aa sequence and still get sequences which ESMFold will correctly fold into a structure.

I can share some code to help out with this if you'd like

Thank you very much and thank you very much too, lupo. You would be really kind if you could pass me an example of code

ulupo commented 1 year ago

@francescopatane96 our code (based on MSA Transformer) is here: https://github.com/Bitbol-Lab/Iterative_masking. Umberto

francescopatane96 commented 1 year ago

@francescopatane96 our code (based on MSA Transformer) is here: https://github.com/Bitbol-Lab/Iterative_masking. Umberto

Thank you, Umberto. Very interesting work :)