Baseline evaluation mixtric missing

sorobedio commented 2 months ago

Hello,

I ran the GLUE baseline but couldn't reproduce the performance mentioned in the paper, likely because the evaluation metrics seem to be missing. You used accuracy and cross-entropy, whereas some tasks require metrics like F1 score or Matthews correlation. I am specifically referring to the baseline for DistilBERT.

Could you please include the evaluation code?

Thank you.

IshiKura-a commented 2 months ago

The evaluation of GLUE benchmark needs to upload the predictions to its website: https://gluebenchmark.com/

Finishing the training process, you will find the prediction files in your output dir, then you need to use util/format_glue_output.py to reformat them, zip them and upload to the website.

Mind that for STS-B task, if the predictions have outputs out of the range 0~5, you need to truncate it manually to 0 or 5 to avoid the errors the website could raise. And the website has only 3 chances of successful evaluations.

When a evaluation is done, you can get the detailed scores by clicking on the hyper-link of the ceil Model like: GLUE

sorobedio commented 2 months ago

Aha, okay, thank you. I was considering evaluating locally.

IshiKura-a / ModelGPT

Baseline evaluation mixtric missing #2