Obtaining a embedding for sequence

SalvatoreRa commented 1 year ago

Thank you for sharing this fantastic work, I have started to experiment with and it is great.

I have used the HuggingFace version to extract a representation for each protein, the idea is to use this representation in another application. Here, I am using two sequences and I have seen it provides a vector of 320-dimension for each aminoacid (I guess). Then to have a single vector for each sequence I used the mean. Do you advise me to do it differently? Should I use some different output? the embedding layer?

Here the example code:

from transformers import EsmTokenizer, EsmModel
import torch

tokenizer = EsmTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")
model = EsmModel.from_pretrained("facebook/esm2_t6_8M_UR50D")
seqs =["QERLKSIVRILE", "QERLKSIVRILEEEERRRRRRFFFFFRRRFFRRFRRFFRFFR"]
inputs = tokenizer(seqs, return_tensors="pt", padding=True, truncation=True)
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state
x = last_hidden_states.detach()
x.mean(axis=1)

Thank you very much

tomsercu commented 1 year ago

Taking the mean is what we recommend indeed. The extract.py script in this repo also provides that option.

arjan-res commented 1 year ago

@SalvatoreRa and @tomsercu - With x.mean(axis=1) this will be a per-protein embedding with torch.Size([1, 1280])?

facebookresearch / esm

Obtaining a embedding for sequence #348