agemagician / ProtTrans

ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.
Academic Free License v3.0
1.05k stars 150 forks source link

Error in Generation of Embeddings from list of sequences via ProtT5 #134

Open amalislam675 opened 8 months ago

amalislam675 commented 8 months ago

I am generating the embedding on my protein sequences via ProtT5 by the following code. I have total 5000 protein sequences which I am providing as list. I am fixing the max_length parameter to 500, but it gives me an out of memory error. Can you help me to fix? I have generate per protein embeddings. The final output which I want is of (5000, 1024).

RuntimeError: CUDA out of memory. Tried to allocate 10.77 GiB

Code: `p_sequence = list(p_sequence)

from transformers import T5Tokenizer, T5EncoderModel import torch import re

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

Load the tokenizer

tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_half_uniref50-enc', do_lower_case=False)

Load the model

model = T5EncoderModel.from_pretrained("Rostlab/prot_t5_xl_half_uniref50-enc").to(device)

only GPUs support half-precision currently; if you want to run on CPU use full-precision (not recommended, much slower)

model.full() if device=='cpu' else model.half()

prepare the protein sequences as a list

p_sequence = p_sequence

replace all rare/ambiguous amino acids by X and introduce white-space between all amino acids

sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in p_sequence]

tokenize sequences and pad up to the longest sequence in the batch

ids = tokenizer(sequence_examples, add_special_tokens=True, padding="max_length", truncation=True, max_length=500)

input_ids = torch.tensor(ids['input_ids']).to(device) attention_mask = torch.tensor(ids['attention_mask']).to(device)

generate embeddings

with torch.no_grad(): embedding_repr = model(input_ids=input_ids, attention_mask=attention_mask)

extract residue embeddings for each sequence in the batch and removed padded and special tokens

emb_list = [] for i in range(len(sequence_examples)): emb_i = embedding_repr.last_hidden_state[i] emb_list.append(emb_i)

take mean of embedding vectors for the entire protein

emb_per_protein_list = [] for emb in emb_list: emb_per_protein = torch.mean(emb, dim=0) emb_per_protein_list.append(emb_per_protein)`

mheinzinger commented 8 months ago

Seems like you are running out of vRAM. Try to generate embeddings for each protein in your set individually. (from the code above it seems as if you were to embed all proteins simultaneously.) If this does not resolve your issue, you might have to lower the max_length even further (though, I guess switchting to single-sequence-processing instead of batching already solves the issue)

amalislam675 commented 7 months ago

@mheinzinger , thanks I have solved my issue. Can you please tell, how to select the maximum length of residue from our protein sequences? In ProtT5 model, there is an option to select max_length of residues.

mheinzinger commented 7 months ago

I usually do not set the parameter at all. ProtT5 has learnt positional encoding and can (to a certain extent) also embed protein sequences longer than the ones seen during training. I always embed full-length proteins up to the point where they trigger out-of-memory on my GPU; those get removed from the dataset.

amalislam675 commented 6 months ago

I usually do not set the parameter at all. ProtT5 has learnt positional encoding and can (to a certain extent) also embed protein sequences longer than the ones seen during training. I always embed full-length proteins up to the point where they trigger out-of-memory on my GPU; those get removed from the dataset.

@mheinzinger , on my use case, protein sequences are not comprised of PDB chains, my protein sequences are generated against some protein that belong to reviewed swissprot of uniporotKB entries. I want to do feature extraction for my protein sequences with ProtT5 model. Can you tell me which code better fits for my use case. The one which is mentioned in this link https://colab.research.google.com/drive/1TUj-ayG3WO52n5N50S7KH9vtt6zRkdmj?usp=sharing , or the other one which is mentioned here: https://colab.research.google.com/drive/1h7F5v5xkE_ly-1bTQSu-1xaLtTP2TnLF?usp=sharing. Should I generate per protein representations or per residue representations. If I single protein sequence to ProtT5 instead of batch, will the embedding that will generate via ProtT5 be same as are produced if we provide sequences in batch. Or, by providing sequences in batch we got more optimize results.

mheinzinger commented 6 months ago

Providing sequences as batch or processing them as single sequences should not make a difference (except for batching being faster). Whether you want to generate per-residue or per-protein embeddings is completely up to your use-case so I can not tell, sorry. This notebook provides you an example on how to also run a predictor on top of embeddings. In contrast, this second notebook only has the embedding generation part. So if you are solely interested in generating embeddings without any prediction, the second link is probably easier (but the first one should give you the same plus more).