Open amalislam675 opened 8 months ago
Seems like you are running out of vRAM. Try to generate embeddings for each protein in your set individually. (from the code above it seems as if you were to embed all proteins simultaneously.) If this does not resolve your issue, you might have to lower the max_length even further (though, I guess switchting to single-sequence-processing instead of batching already solves the issue)
@mheinzinger , thanks I have solved my issue. Can you please tell, how to select the maximum length of residue from our protein sequences? In ProtT5 model, there is an option to select max_length of residues.
I usually do not set the parameter at all. ProtT5 has learnt positional encoding and can (to a certain extent) also embed protein sequences longer than the ones seen during training. I always embed full-length proteins up to the point where they trigger out-of-memory on my GPU; those get removed from the dataset.
I usually do not set the parameter at all. ProtT5 has learnt positional encoding and can (to a certain extent) also embed protein sequences longer than the ones seen during training. I always embed full-length proteins up to the point where they trigger out-of-memory on my GPU; those get removed from the dataset.
@mheinzinger , on my use case, protein sequences are not comprised of PDB chains, my protein sequences are generated against some protein that belong to reviewed swissprot of uniporotKB entries. I want to do feature extraction for my protein sequences with ProtT5 model. Can you tell me which code better fits for my use case. The one which is mentioned in this link https://colab.research.google.com/drive/1TUj-ayG3WO52n5N50S7KH9vtt6zRkdmj?usp=sharing , or the other one which is mentioned here: https://colab.research.google.com/drive/1h7F5v5xkE_ly-1bTQSu-1xaLtTP2TnLF?usp=sharing. Should I generate per protein representations or per residue representations. If I single protein sequence to ProtT5 instead of batch, will the embedding that will generate via ProtT5 be same as are produced if we provide sequences in batch. Or, by providing sequences in batch we got more optimize results.
Providing sequences as batch or processing them as single sequences should not make a difference (except for batching being faster). Whether you want to generate per-residue or per-protein embeddings is completely up to your use-case so I can not tell, sorry. This notebook provides you an example on how to also run a predictor on top of embeddings. In contrast, this second notebook only has the embedding generation part. So if you are solely interested in generating embeddings without any prediction, the second link is probably easier (but the first one should give you the same plus more).
I am generating the embedding on my protein sequences via ProtT5 by the following code. I have total 5000 protein sequences which I am providing as list. I am fixing the max_length parameter to 500, but it gives me an out of memory error. Can you help me to fix? I have generate per protein embeddings. The final output which I want is of (5000, 1024).
RuntimeError: CUDA out of memory. Tried to allocate 10.77 GiB
Code: `p_sequence = list(p_sequence)
from transformers import T5Tokenizer, T5EncoderModel import torch import re
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
Load the tokenizer
tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_half_uniref50-enc', do_lower_case=False)
Load the model
model = T5EncoderModel.from_pretrained("Rostlab/prot_t5_xl_half_uniref50-enc").to(device)
only GPUs support half-precision currently; if you want to run on CPU use full-precision (not recommended, much slower)
model.full() if device=='cpu' else model.half()
prepare the protein sequences as a list
p_sequence = p_sequence
replace all rare/ambiguous amino acids by X and introduce white-space between all amino acids
sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in p_sequence]
tokenize sequences and pad up to the longest sequence in the batch
ids = tokenizer(sequence_examples, add_special_tokens=True, padding="max_length", truncation=True, max_length=500)
input_ids = torch.tensor(ids['input_ids']).to(device) attention_mask = torch.tensor(ids['attention_mask']).to(device)
generate embeddings
with torch.no_grad(): embedding_repr = model(input_ids=input_ids, attention_mask=attention_mask)
extract residue embeddings for each sequence in the batch and removed padded and special tokens
emb_list = [] for i in range(len(sequence_examples)): emb_i = embedding_repr.last_hidden_state[i] emb_list.append(emb_i)
take mean of embedding vectors for the entire protein
emb_per_protein_list = [] for emb in emb_list: emb_per_protein = torch.mean(emb, dim=0) emb_per_protein_list.append(emb_per_protein)`