Sequence processing - Githubissues

agemagician / ProtTrans

ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.

Academic Free License v3.0

1.13k stars 153 forks source link

Sequence processing #123

Closed BSharmi closed 1 year ago

BSharmi commented 1 year ago

Hi there!

Have a quick question regarding sequence tokenization

If I am tokenizing sequence is it necessary that I convert the U, Z and O to X as done in https://github.com/agemagician/ProtTrans/blob/master/Embedding/prott5_embedder.py#L90

Thank you, Sharmi

mheinzinger commented 1 year ago

Hi :) no, it does not have to be done. However, those tokens are then mapped to the unknown token or (some of our models/tokenizers still have those tokens), the resulting embedding won't be very meaningful given how rarely the model encountered them during training.

BSharmi commented 1 year ago

Gotcha thank you!

One more question (because I am here!) do you have the secondary structure prediction fine tune for ProtT5?

I don't think I can use AutoModelForTokenClassification for T5 so may be have to create a head along with the encoder as backbone?

Thank you!

mheinzinger commented 1 year ago

Nope, we have no version of ProtT5 finetuned on secondary structure. But I think you are perfectly on the right track: I would also just put a custom head on top of the Encoder model and finetune from there. Minor side-remark: we tried this at one point and did not improve over keeping the encoder frozen and training a small CNN on top. So plain finetuning of all parameters did not seem to be the way to go. If I had to redo this now, I would probably rather go for is something like this: https://github.com/r-three/t-few/tree/master

BSharmi commented 1 year ago

Awesome I will be using LoRA!!Thanks so muchOn Jul 5, 2023, at 11:45 PM, Michael Heinzinger @.***> wrote: Nope, we have no version of ProtT5 finetuned on secondary structure. But I think you are perfectly on the right track: I would also just put a custom head on top of the Encoder model and finetune from there. Minor side-remark: we tried this at one point and did not improve over keeping the encoder frozen and training a small CNN on top. So plain finetuning of all parameters did not seem to be the way to go. If I had to redo this now, I would probably rather go for is something like this: https://github.com/r-three/t-few/tree/master

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

BSharmi commented 1 year ago

One last question, in the ProtTrans paper for secondary structure you guys showed that T5 outperformed all models. Was that the full encoder-decoder model and used as T5ForConditionalGeneration for token classification? I noticed similar approach for general T5 model https://colab.research.google.com/drive/1obr78FY_cBmWY5ODViCmzdY6O1KB65Vc?usp=sharing so wanted to double check the results on the paper.

Thank you!

mheinzinger commented 1 year ago

We always only used the encoder-only model for any predictive downstream task. You only need the decoder if you want to derive e.g. log-odds or actually generate sequences.