agemagician / ProtTrans

ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.
Academic Free License v3.0
1.1k stars 152 forks source link

Long Protein sequences embeddings #70

Closed rominaappierdo closed 2 years ago

rominaappierdo commented 2 years ago

Hello and thank you for sharing your work!

I'm trying to obtain protein embeddings with ProtBert and I used this code to obtain the representations

from transformers import BertModel, BertTokenizer
import re
tokenizer = BertTokenizer.from_pretrained("Rostlab/prot_bert", do_lower_case=False )
model = BertModel.from_pretrained("Rostlab/prot_bert")
sequence_Example = "A E T C Z A O"
sequence_Example = re.sub(r"[UZOB]", "X", sequence_Example)
encoded_input = tokenizer(sequence_Example, return_tensors='pt')
output = model(**encoded_input)
embedding_vector=output[1][0]

I would like to know, since I have in my dataset sequences of different lengths, if there is any limitation in the sequence length your model is able to represent...? (e.g. 512, as usually happens)

mheinzinger commented 2 years ago

Hi, in general, I would recommend to use ProtT5, not ProtBERT, as ProtT5 gave better performance in all our benchmarks so far. Especially, for longer sequences ProtT5 has the advantage of having learned positional encoding, which should generalize better to sequences longer than the ones trained on (at least, that was shown for NLP, if I'm not mistaken). That being said: our current limit on sequence length is usually the available vRAM. So our models CAN handle sequences longer than the ones they were trained on (as you said: models were trained on proteins of length 512 for ProtT5). However, you might want to monitor the effect on performance for those explicitly (e.g. create a subset of your test set holding only proteins >512, and benchmark them individually so that you can confirm that there is no length-effect. If you want to stick to ProtBERT, the same applies: you can embed sequences longer than the ones the model was trained on (in case of ProtBERT the longest sequences during training were L=1024).

Hope this helped, best, Michael

rominaappierdo commented 2 years ago

Sorry to bother you again; on your advice I was trying to get protein embeddings using ProtT5 instead of ProtBERT. I'm trying to use this code

from transformers import T5Tokenizer, T5Model
import re
import torch

tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_bfd', do_lower_case=False)

model = T5Model.from_pretrained("Rostlab/prot_t5_xl_bfd")

sequences_Example = ["A E T C Z A O","S K T Z P"]

sequences_Example = [re.sub(r"[UZOB]", "X", sequence) for sequence in sequences_Example]

ids = tokenizer.batch_encode_plus(sequences_Example, add_special_tokens=True, padding=True)

input_ids = torch.tensor(ids['input_ids'])
attention_mask = torch.tensor(ids['attention_mask'])

with torch.no_grad():
    embedding = model(input_ids=input_ids,attention_mask=attention_mask,decoder_input_ids=None)

# For feature extraction we recommend to use the encoder embedding
encoder_embedding = embedding[2].cpu().numpy()
decoder_embedding = embedding[0].cpu().numpy()

However, I'm getting this error:

"You have to specify either decoder_input_ids or decoder_inputs_embeds"

I tried to solve it but I actually have no clue how I could get decoder_input_ids or decoder_inputs_embeds. Might you help?

I want to take advantage to thank you for your previous reply, it helped a lot.

mheinzinger commented 2 years ago

I think it might be better to load the T5EncoderModel right away, thereby dropping the Decoder side which you won't use from the very beginning. An example is given here

rominaappierdo commented 2 years ago

Thank you so much