evolutionaryscale / esm

Other
797 stars 81 forks source link

About Generating Protein Sequence Embeddings with Your Model #2

Closed BlenderWang9487 closed 1 week ago

BlenderWang9487 commented 1 week ago

Hi!

Thank you for your great work. I would like to ask if your model can be used solely for generating protein sequence embedding. For example, given a protein sequence, is there a function that produces its embedding for downstream tasks such as similarity search or property prediction with a simple linear head?

If so, do you have an example script that I can refer to? Or is there a best practice for generating such embedding?

Thank you!

santiag0m commented 1 week ago

Sure, you can get embeddings out of ESM3!

from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, SamplingConfig
from esm.utils.constants.models import ESM3_OPEN_SMALL

client = ESM3.from_pretrained(ESM3_OPEN_SMALL, device="cuda")

# Peptidase S1A, chymotrypsin family: https://www.ebi.ac.uk/interpro/structure/PDB/1utn/
protein = ESMProtein(
    sequence=(
        "FIFLALLGAAVAFPVDDDDKIVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWVVSAAHCYKSGIQVRLGEDNINVVEG"
        "NEQFISASKSIVHPSYNSNTLNNDIMLIKLKSAASLNSRVASISLPTSCASAGTQCLISGWGNTKSSGTSYPDVLKCLKAP"
        "ILSDSSCKSAYPGQITSNMFCAGYLEGGKDSCQGDSGGPVVCSGKLQGIVSWGSGCAQKNKPGVYTKVCNYVSWIKQTIASN"
    )
)
protein_tensor = client.encode(protein)

output = client.forward_and_sample(
    protein_tensor, SamplingConfig(return_per_residue_embeddings=True)
)
print(output.per_residue_embedding.shape)

If you have a PDB file you can also load the protein directly:

protein = ESMProtein.from_pdb("./1utn.pdb")
BlenderWang9487 commented 1 week ago

Got it! 👍

fulacse commented 4 days ago

Sure, you can get embeddings out of ESM3!

from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, SamplingConfig
from esm.utils.constants.models import ESM3_OPEN_SMALL

client = ESM3.from_pretrained(ESM3_OPEN_SMALL, device="cuda")

# Peptidase S1A, chymotrypsin family: https://www.ebi.ac.uk/interpro/structure/PDB/1utn/
protein = ESMProtein(
    sequence=(
        "FIFLALLGAAVAFPVDDDDKIVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWVVSAAHCYKSGIQVRLGEDNINVVEG"
        "NEQFISASKSIVHPSYNSNTLNNDIMLIKLKSAASLNSRVASISLPTSCASAGTQCLISGWGNTKSSGTSYPDVLKCLKAP"
        "ILSDSSCKSAYPGQITSNMFCAGYLEGGKDSCQGDSGGPVVCSGKLQGIVSWGSGCAQKNKPGVYTKVCNYVSWIKQTIASN"
    )
)
protein_tensor = client.encode(protein)

output = client.forward_and_sample(
    protein_tensor, SamplingConfig(return_per_residue_embeddings=True)
)
print(output.per_residue_embedding.shape)

If you have a PDB file you can also load the protein directly:

protein = ESMProtein.from_pdb("./1utn.pdb")

If I understand, output shape is [num_amino_acid, dim_embed]. I want process multi proteins at a time to get a tensor shaping [batch_size, num_amino_acid, dim_embed]. How to make a batch?

ddofer commented 20 minutes ago

I'd suggest adding this as an example or even in the readme, it's gonna be a recurring question. (Ideally, running on a large set of sequences, and with the over trained small model?). Thanks!