Closed BlenderWang9487 closed 1 week ago
Sure, you can get embeddings out of ESM3!
from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, SamplingConfig
from esm.utils.constants.models import ESM3_OPEN_SMALL
client = ESM3.from_pretrained(ESM3_OPEN_SMALL, device="cuda")
# Peptidase S1A, chymotrypsin family: https://www.ebi.ac.uk/interpro/structure/PDB/1utn/
protein = ESMProtein(
sequence=(
"FIFLALLGAAVAFPVDDDDKIVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWVVSAAHCYKSGIQVRLGEDNINVVEG"
"NEQFISASKSIVHPSYNSNTLNNDIMLIKLKSAASLNSRVASISLPTSCASAGTQCLISGWGNTKSSGTSYPDVLKCLKAP"
"ILSDSSCKSAYPGQITSNMFCAGYLEGGKDSCQGDSGGPVVCSGKLQGIVSWGSGCAQKNKPGVYTKVCNYVSWIKQTIASN"
)
)
protein_tensor = client.encode(protein)
output = client.forward_and_sample(
protein_tensor, SamplingConfig(return_per_residue_embeddings=True)
)
print(output.per_residue_embedding.shape)
If you have a PDB file you can also load the protein directly:
protein = ESMProtein.from_pdb("./1utn.pdb")
Got it! 👍
Sure, you can get embeddings out of ESM3!
from esm.models.esm3 import ESM3 from esm.sdk.api import ESMProtein, SamplingConfig from esm.utils.constants.models import ESM3_OPEN_SMALL client = ESM3.from_pretrained(ESM3_OPEN_SMALL, device="cuda") # Peptidase S1A, chymotrypsin family: https://www.ebi.ac.uk/interpro/structure/PDB/1utn/ protein = ESMProtein( sequence=( "FIFLALLGAAVAFPVDDDDKIVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWVVSAAHCYKSGIQVRLGEDNINVVEG" "NEQFISASKSIVHPSYNSNTLNNDIMLIKLKSAASLNSRVASISLPTSCASAGTQCLISGWGNTKSSGTSYPDVLKCLKAP" "ILSDSSCKSAYPGQITSNMFCAGYLEGGKDSCQGDSGGPVVCSGKLQGIVSWGSGCAQKNKPGVYTKVCNYVSWIKQTIASN" ) ) protein_tensor = client.encode(protein) output = client.forward_and_sample( protein_tensor, SamplingConfig(return_per_residue_embeddings=True) ) print(output.per_residue_embedding.shape)
If you have a PDB file you can also load the protein directly:
protein = ESMProtein.from_pdb("./1utn.pdb")
If I understand, output shape is [num_amino_acid, dim_embed]. I want process multi proteins at a time to get a tensor shaping [batch_size, num_amino_acid, dim_embed]. How to make a batch?
I'd suggest adding this as an example or even in the readme, it's gonna be a recurring question. (Ideally, running on a large set of sequences, and with the over trained small model?). Thanks!
Hi!
Thank you for your great work. I would like to ask if your model can be used solely for generating protein sequence embedding. For example, given a protein sequence, is there a function that produces its embedding for downstream tasks such as similarity search or property prediction with a simple linear head?
If so, do you have an example script that I can refer to? Or is there a best practice for generating such embedding?
Thank you!