Closed MoaazK closed 1 year ago
Hi,
the notebook you linked has a section on "7. Tokenize labels". In this section each token in the output-space (the labels you have) gets assigned to an integer/class. You can either re-write this section such that it works with your labels or you adjust the load_dataset function such that it works with your data.
Hi,
Thanks for your response. I have already made all of these changes before posting this issue. My dataset has two columns in the csv file (same format as it is being used in the sample notebook): "input" and "allosteric_labels"
As you mentioned to re-write this section 7, I believe that I don't have to make any changes to this section as it takes "train_labels" variable and creates id2tag and tag2id, dynamically. Please see the attached screenshot.
Following screenshot shows that, when model was loaded it uses two labels: N: 0 Y: 1
However, I could not figure out where the problem is. As can be seen in my last comment that F1, Precision, Recall are zeros and even Training Loss has "No Log" value. I even attached the notebook which I am working on
I forgot to attach the dataset files in my previous comment. I have attached in this comment. NotebookWithData.zip
In case it's still relevant, maybe check our new finetuning scripts: https://github.com/agemagician/ProtTrans/tree/master/Fine-Tuning
Hi,
I am trying to finetune a model for a problem set related to one of the notebooks (https://github.com/agemagician/ProtTrans/blob/master/Fine-Tuning/ProtBert-BFD-FineTune-SS3.ipynb). I have small annotated dataset of sequences (and transformed them same as files used in the aforementioned notebook). My dataset has labels N and Y (Y being allosteric residue/amino-acid and N vice versa). However, the problem is I am getting 0 F1 score and it gives me errors such as:
.conda/envs/thesis/lib/python3.8/site-packages/seqeval/metrics/sequence_labeling.py:171: UserWarning: N seems not to be NE tag
.conda/envs/thesis/lib/python3.8/site-packages/seqeval/metrics/sequence_labeling.py:171: UserWarning: Y seems not to be NE tag.
.conda/envs/thesis/lib/python3.8/site-packages/seqeval/metrics/v1.py:159: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use
zero_divisionparameter to control this behavior.
Please find attached the screenshots of errors and metrics.
What can be done to finetune for this scenario?
NB: My dataset is around currently 91 proteins. And when I replaced N with C and Y with E (as the labels used in the aforementioned notebook), it started to give me around ~0.15 F1 score. I could not understand this behavior.