agemagician / ProtTrans

ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.
Academic Free License v3.0
1.13k stars 153 forks source link

FineTuning of Model for Allosteric Site Prediction #106

Closed MoaazK closed 1 year ago

MoaazK commented 1 year ago

Hi,

I am trying to finetune a model for a problem set related to one of the notebooks (https://github.com/agemagician/ProtTrans/blob/master/Fine-Tuning/ProtBert-BFD-FineTune-SS3.ipynb). I have small annotated dataset of sequences (and transformed them same as files used in the aforementioned notebook). My dataset has labels N and Y (Y being allosteric residue/amino-acid and N vice versa). However, the problem is I am getting 0 F1 score and it gives me errors such as:

Please find attached the screenshots of errors and metrics.

What can be done to finetune for this scenario?

NB: My dataset is around currently 91 proteins. And when I replaced N with C and Y with E (as the labels used in the aforementioned notebook), it started to give me around ~0.15 F1 score. I could not understand this behavior.

image image

mheinzinger commented 1 year ago

Hi,

the notebook you linked has a section on "7. Tokenize labels". In this section each token in the output-space (the labels you have) gets assigned to an integer/class. You can either re-write this section such that it works with your labels or you adjust the load_dataset function such that it works with your data.

MoaazK commented 1 year ago

Hi,

Thanks for your response. I have already made all of these changes before posting this issue. My dataset has two columns in the csv file (same format as it is being used in the sample notebook): "input" and "allosteric_labels"

As you mentioned to re-write this section 7, I believe that I don't have to make any changes to this section as it takes "train_labels" variable and creates id2tag and tag2id, dynamically. Please see the attached screenshot. image

Following screenshot shows that, when model was loaded it uses two labels: N: 0 Y: 1 image

However, I could not figure out where the problem is. As can be seen in my last comment that F1, Precision, Recall are zeros and even Training Loss has "No Log" value. I even attached the notebook which I am working on

ProtBert-BFD-FineTune-SS3.zip

MoaazK commented 1 year ago

I forgot to attach the dataset files in my previous comment. I have attached in this comment. NotebookWithData.zip

mheinzinger commented 1 year ago

In case it's still relevant, maybe check our new finetuning scripts: https://github.com/agemagician/ProtTrans/tree/master/Fine-Tuning