FineTuning of Model for Allosteric Site Prediction

MoaazK commented 1 year ago

Hi,

I am trying to finetune a model for a problem set related to one of the notebooks (https://github.com/agemagician/ProtTrans/blob/master/Fine-Tuning/ProtBert-BFD-FineTune-SS3.ipynb). I have small annotated dataset of sequences (and transformed them same as files used in the aforementioned notebook). My dataset has labels N and Y (Y being allosteric residue/amino-acid and N vice versa). However, the problem is I am getting 0 F1 score and it gives me errors such as:

.conda/envs/thesis/lib/python3.8/site-packages/seqeval/metrics/sequence_labeling.py:171: UserWarning: N seems not to be NE tag
.conda/envs/thesis/lib/python3.8/site-packages/seqeval/metrics/sequence_labeling.py:171: UserWarning: Y seems not to be NE tag.
.conda/envs/thesis/lib/python3.8/site-packages/seqeval/metrics/v1.py:159: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Usezero_divisionparameter to control this behavior.

Please find attached the screenshots of errors and metrics.

What can be done to finetune for this scenario?

NB: My dataset is around currently 91 proteins. And when I replaced N with C and Y with E (as the labels used in the aforementioned notebook), it started to give me around ~0.15 F1 score. I could not understand this behavior.

mheinzinger commented 1 year ago

Hi,

the notebook you linked has a section on "7. Tokenize labels". In this section each token in the output-space (the labels you have) gets assigned to an integer/class. You can either re-write this section such that it works with your labels or you adjust the load_dataset function such that it works with your data.

MoaazK commented 1 year ago

Hi,

Thanks for your response. I have already made all of these changes before posting this issue. My dataset has two columns in the csv file (same format as it is being used in the sample notebook): "input" and "allosteric_labels"

"input" column has sequence in the format -> M A D T K A K L T L N G D T A V E L D V L K
"allosteric_labels" has labels for each residue -> N N N N N Y Y Y Y Y N N N N Y Y N N N N N N as can be seen in the attached photo.

As you mentioned to re-write this section 7, I believe that I don't have to make any changes to this section as it takes "train_labels" variable and creates id2tag and tag2id, dynamically. Please see the attached screenshot.

Following screenshot shows that, when model was loaded it uses two labels: N: 0 Y: 1

However, I could not figure out where the problem is. As can be seen in my last comment that F1, Precision, Recall are zeros and even Training Loss has "No Log" value. I even attached the notebook which I am working on

ProtBert-BFD-FineTune-SS3.zip

MoaazK commented 1 year ago

I forgot to attach the dataset files in my previous comment. I have attached in this comment. NotebookWithData.zip

mheinzinger commented 1 year ago

In case it's still relevant, maybe check our new finetuning scripts: https://github.com/agemagician/ProtTrans/tree/master/Fine-Tuning

agemagician / ProtTrans

FineTuning of Model for Allosteric Site Prediction #106