bigscience-workshop / lm-evaluation-harness

A framework for few-shot evaluation of autoregressive language models.
MIT License
101 stars 30 forks source link

different score ranges are confusing #119

Open Muennighoff opened 2 years ago

Muennighoff commented 2 years ago

it's confusing bleu scores are 0-100 & rouge 0-1 in this repo; I think either all scores should 0-100 or 0-1, probably the former

jon-tow commented 2 years ago

I agree that standardizing on the $[0, 100]$ range is ideal for the readability of these scores. The difference here is that the underlying sacreblue package scales BLEU/TER/chrF scores by $100$. These are the only metrics in the harness that are scaled (accuracy, ROUGE, SARI, etc. are not). So, to make everything consistent for now, we can re-scale BLEU back to its "natural" units in $[0, 1]$ and follow up with an optional per-metric "results-formatter". What do you think?

StellaAthena commented 2 years ago

I think this suggestion makes a lot of sense. Additionally, it would be nice to have the option to get rounded answers, e.g., 17.7%.