Closed sorobedio closed 2 months ago
The evaluation of GLUE benchmark needs to upload the predictions to its website: https://gluebenchmark.com/
Finishing the training process, you will find the prediction files in your output dir, then you need to use util/format_glue_output.py
to reformat them, zip them and upload to the website.
Mind that for STS-B task, if the predictions have outputs out of the range 0~5, you need to truncate it manually to 0 or 5 to avoid the errors the website could raise. And the website has only 3 chances of successful evaluations.
When a evaluation is done, you can get the detailed scores by clicking on the hyper-link of the ceil Model like:
Aha, okay, thank you. I was considering evaluating locally.
Hello,
I ran the GLUE baseline but couldn't reproduce the performance mentioned in the paper, likely because the evaluation metrics seem to be missing. You used accuracy and cross-entropy, whereas some tasks require metrics like F1 score or Matthews correlation. I am specifically referring to the baseline for DistilBERT.
Could you please include the evaluation code?
Thank you.