Closed miderxi closed 1 year ago
Hi, unfortunately, there is not one finetuning/training code as those are very different models. We always only used the publicly available training code from the NLP repositories. However, those have changed significantly since we trained our pLMs. However, it is really easy to adjust the code from the huggingface examples such that you can train/fine-tune, for example, ProtT5 on your own protein sequences: https://github.com/huggingface/transformers/tree/main/examples/flax/language-modeling
我已收到邮件,会尽快处理。
I want to ask, can the result of feature extraction via ProtT5-XL-U50 be converted back into input_ids and attention_ids? I'm trying to do data augmentation from the results of this feature extraction, and I get the augmentation results. What I want is to convert the feature extraction back into a protein sequence. How to solve this?
我已收到邮件,会尽快处理。
I want to ask, can the result of feature extraction via ProtT5-XL-U50 be converted back into input_ids and attention_ids? I'm trying to do data augmentation from the results of this feature extraction, and I get the augmentation results. What I want is to convert the feature extraction back into a protein sequence. How to solve this?
I am not 100% sure what you try to achieve but I assume that you want to generate amino acid sequences from embeddings. If that's correct, you would have to feed the extracted features/embeddings again to the model using this function: https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5Model.forward.encoder_outputs This always assumes that you extracted embeddings only from the encoder side, modified/augmented them somehow and now want to get back a sequence
I want to ask, can the result of feature extraction via ProtT5-XL-U50 be converted back into input_ids and attention_ids? I'm trying to do data augmentation from the results of this feature extraction, and I get the augmentation results. What I want is to convert the feature extraction back into a protein sequence. How to solve this?
I am not 100% sure what you try to achieve but I assume that you want to generate amino acid sequences from embeddings. If that's correct, you would have to feed the extracted features/embeddings again to the model using this function: https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5Model.forward.encoder_outputs This always assumes that you extracted embeddings only from the encoder side, modified/augmented them somehow and now want to get back a sequence
Can you explain more clearly? I can't find a solution, I get many errors.
transformer_link1 = "Rostlab/prot_t5_xl_uniref50" print("Loading: {}".format(transformer_link1)) modelT5XLuniref50 = T5EncoderModel.from_pretrained(transformer_link1) modelT5XLuniref50 = modelT5XLuniref50.to(device) modelT5XLuniref50 = modelT5XLuniref50.eval() tokenizerT5XLuniref50 = T5Tokenizer.from_pretrained(transformer_link1, do_lower_case=False )
sequencess =["PRTEINO", "SEQWENCE"] sequencess = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in sequencess]
ids = tokenizerT5XLuniref50.batch_encode_plus(sequences_example, add_special_tokens=True, padding="longest")
input_ids = torch.tensor(ids['input_ids']).to(device) attention_mask = torch.tensor(ids['attention_mask']).to(device)
with torch.no_grad(): embedding_rpr = modelT5XLuniref50(input_ids=input_ids,attention_mask=attention_mask)
outputseq = modelT5XLuniref50(encoder_outputs=embedding_rpr) #error here
model = T5Model.from_pretrained("Rostlab/prot_t5_xl_uniref50") outputseq = model(encoder_outputs=embedding_rpr) #error here
I really don't understand
I am not sure what error you exactly get but from your code above I assume that you just pass the hidden states of the last layer. Instead, I assume you should rather pass the hidden states of all encoder layers. The easiest option to do so is probably to go via the return_dict=True
. An example on how to do this is given here: https://huggingface.co/docs/transformers/v4.20.1/en/main_classes/text_generation#transformers.generation_utils.GenerationMixin.beam_search.example
我已收到邮件,会尽快处理。
If we want to tune the model for our own sequences, can you provide an example file with some sequences in the correct format?
我已收到邮件,会尽快处理。
If we want to tune the model for our own sequences, can you provide an example file with some sequences in the correct format?
hi; sorry, the best I can provide is pointing you towards the scripts that I have used for finetuning ProtT5 for our most recent work: https://github.com/mheinzinger/ProstT5#-training-scripts These two scripts give you all you need to either continue pre-training/span_denoising and translation. I hope this helps;
K V F G R C E L A A A
This depends heavily on the task you want to perform. If you just want to continue the original pre-training, you'll have to corrupt some tokens in your input sequence by replacing single amino acids by special characters. So your input would look something like S E Q <extra_id_0> W E N C E
and then your target/output should look like this: <\s> <extra_id_0> W
.
We just used standard span-corruption using single amino acids as tokens so you can recycle nearly everything from the script that I referred to above: https://github.com/huggingface/transformers/blob/main/examples/flax/language-modeling/run_t5_mlm_flax.py
我已收到邮件,会尽快处理。
Please add a code that can train (fine tune) the prottrans model using your own sequence. Or I can give you my sequence, please help me fine tune it.
Thank you