Open choomegan opened 3 months ago
Hi @choomegan ,
could you check if you see that the best model was saved in previous epochs? I hypothesize that the best model with non Zero F1-Score was found in previous epochs, which would explain why you actually achieve a non-zero F1-Score on the test set. So maybe you can post the full log output here :)
Hi @stefan-it I have attached the full logs here: flair_finetune.log
I ran inference with final-model.pt
model and the results (0.8490) match the test results seen at the end of the training log file.
Seems like the best model is not saved as only ['training.log', 'final-model.pt', 'test.tsv', 'dev.tsv', 'loss.tsv']
is found in /flair-output
which is the base_path
I specified in the trainer.finetune method.
Hello @choomegan. Have you found the solution to the issue yet? I have the same problem with validation F1 being ~0 in the loss.tsv
, while the test F1 takes good values ( 0.76, 0.8,..)
Hello @choomegan. Have you found the solution to the issue yet? I have the same problem with validation F1 being ~0 in the
loss.tsv
, while the test F1 takes good values ( 0.76, 0.8,..)
Hi @Aakame, I have not found a solution to the issue yet :( @stefan-it would you be able to assist? Thanks!
Hello @choomegan. Have you found the solution to the issue yet? I have the same problem with validation F1 being ~0 in the
loss.tsv
, while the test F1 takes good values ( 0.76, 0.8,..)Hi @Aakame, I have not found a solution to the issue yet :( @stefan-it would you be able to assist? Thanks!
I've downgraded Flair to version 12.2, and it appears that the tsv.loss
now produces the correct values for DEV_F1. It seems that the bug may be present only in the most recent version.
The only difference between the faulty DEV evaluations that happen after each epoch and the correct final TEST evaluation is the storage of the embeddings, which doesnt happen in the latter case:
store_embeddings(evaluation_split_data, embeddings_storage_mode)
I found out that when I set the embeddings_storage_mode to "none" the DEV evaluation happens correctly again and the score becomes higher than zero.
@stefan-it I guess the gold labels get wiped out as part as the data_point.clear_embeddings()?
Describe the bug
When training a SequenceTagger for NER with the last layer of RoBERTa embeddings, the micro average F1 score on the validation set is consistently 0, but the training loss is decreasing (as expected). However, the test set F1 score is 0.8490. There is an issue with the logging of validation F1 scores.
My dataset only has 3 possible tags: B-SHI, I-SHI and O.
To Reproduce
Expected behavior
Non-zero F1 validation scores as the training loss is decreasing. Validation F1 score near the end of 150 epochs should be comparable to the test set F1.
Logs and Stack traces