agemagician / ProtTrans

ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.
Academic Free License v3.0
1.13k stars 153 forks source link

I want the prottrans training code. #118

Closed miderxi closed 1 year ago

miderxi commented 1 year ago

Please add a code that can train (fine tune) the prottrans model using your own sequence. Or I can give you my sequence, please help me fine tune it.

Thank you

mheinzinger commented 1 year ago

Hi, unfortunately, there is not one finetuning/training code as those are very different models. We always only used the publicly available training code from the NLP repositories. However, those have changed significantly since we trained our pLMs. However, it is really easy to adjust the code from the huggingface examples such that you can train/fine-tune, for example, ProtT5 on your own protein sequences: https://github.com/huggingface/transformers/tree/main/examples/flax/language-modeling

miderxi commented 1 year ago

我已收到邮件,会尽快处理。

FahrulRozzy commented 1 year ago

I want to ask, can the result of feature extraction via ProtT5-XL-U50 be converted back into input_ids and attention_ids? I'm trying to do data augmentation from the results of this feature extraction, and I get the augmentation results. What I want is to convert the feature extraction back into a protein sequence. How to solve this?

miderxi commented 1 year ago

我已收到邮件,会尽快处理。

mheinzinger commented 1 year ago

I want to ask, can the result of feature extraction via ProtT5-XL-U50 be converted back into input_ids and attention_ids? I'm trying to do data augmentation from the results of this feature extraction, and I get the augmentation results. What I want is to convert the feature extraction back into a protein sequence. How to solve this?

I am not 100% sure what you try to achieve but I assume that you want to generate amino acid sequences from embeddings. If that's correct, you would have to feed the extracted features/embeddings again to the model using this function: https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5Model.forward.encoder_outputs This always assumes that you extracted embeddings only from the encoder side, modified/augmented them somehow and now want to get back a sequence

FahrulRozzy commented 1 year ago

I want to ask, can the result of feature extraction via ProtT5-XL-U50 be converted back into input_ids and attention_ids? I'm trying to do data augmentation from the results of this feature extraction, and I get the augmentation results. What I want is to convert the feature extraction back into a protein sequence. How to solve this?

I am not 100% sure what you try to achieve but I assume that you want to generate amino acid sequences from embeddings. If that's correct, you would have to feed the extracted features/embeddings again to the model using this function: https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5Model.forward.encoder_outputs This always assumes that you extracted embeddings only from the encoder side, modified/augmented them somehow and now want to get back a sequence

Can you explain more clearly? I can't find a solution, I get many errors.

Load the tokenizer

transformer_link1 = "Rostlab/prot_t5_xl_uniref50" print("Loading: {}".format(transformer_link1)) modelT5XLuniref50 = T5EncoderModel.from_pretrained(transformer_link1) modelT5XLuniref50 = modelT5XLuniref50.to(device) modelT5XLuniref50 = modelT5XLuniref50.eval() tokenizerT5XLuniref50 = T5Tokenizer.from_pretrained(transformer_link1, do_lower_case=False )

sequencess =["PRTEINO", "SEQWENCE"] sequencess = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in sequencess]

tokenize sequences and pad up to the longest sequence in the batch

ids = tokenizerT5XLuniref50.batch_encode_plus(sequences_example, add_special_tokens=True, padding="longest")

input_ids = torch.tensor(ids['input_ids']).to(device) attention_mask = torch.tensor(ids['attention_mask']).to(device)

generate embeddings

with torch.no_grad(): embedding_rpr = modelT5XLuniref50(input_ids=input_ids,attention_mask=attention_mask)

Reverts embedding to amino acid sequence again

outputseq = modelT5XLuniref50(encoder_outputs=embedding_rpr) #error here

I tried another way

model = T5Model.from_pretrained("Rostlab/prot_t5_xl_uniref50") outputseq = model(encoder_outputs=embedding_rpr) #error here

I really don't understand

mheinzinger commented 1 year ago

I am not sure what error you exactly get but from your code above I assume that you just pass the hidden states of the last layer. Instead, I assume you should rather pass the hidden states of all encoder layers. The easiest option to do so is probably to go via the return_dict=True. An example on how to do this is given here: https://huggingface.co/docs/transformers/v4.20.1/en/main_classes/text_generation#transformers.generation_utils.GenerationMixin.beam_search.example

miderxi commented 1 year ago

我已收到邮件,会尽快处理。

tsjain commented 1 year ago

If we want to tune the model for our own sequences, can you provide an example file with some sequences in the correct format?

miderxi commented 1 year ago

我已收到邮件,会尽快处理。

mheinzinger commented 1 year ago

If we want to tune the model for our own sequences, can you provide an example file with some sequences in the correct format?

hi; sorry, the best I can provide is pointing you towards the scripts that I have used for finetuning ProtT5 for our most recent work: https://github.com/mheinzinger/ProstT5#-training-scripts These two scripts give you all you need to either continue pre-training/span_denoising and translation. I hope this helps;

tsjain commented 1 year ago

I'll simplify my request, to ask whether my input sequences for tuning ProtT5 should look like the following:

K V F G R C E L A A A

D N Y R G Y S L G N W V C A ----------- Or do I need to add other tokens to my input data?
mheinzinger commented 1 year ago

This depends heavily on the task you want to perform. If you just want to continue the original pre-training, you'll have to corrupt some tokens in your input sequence by replacing single amino acids by special characters. So your input would look something like S E Q <extra_id_0> W E N C E and then your target/output should look like this: <\s> <extra_id_0> W . We just used standard span-corruption using single amino acids as tokens so you can recycle nearly everything from the script that I referred to above: https://github.com/huggingface/transformers/blob/main/examples/flax/language-modeling/run_t5_mlm_flax.py

mheinzinger commented 1 year ago

Update: https://github.com/agemagician/ProtTrans/issues/137#issuecomment-1817576165

miderxi commented 1 year ago

我已收到邮件,会尽快处理。