Open Muennighoff opened 2 years ago
I agree that standardizing on the $[0, 100]$ range is ideal for the readability of these scores. The difference here is that the underlying sacreblue package scales BLEU/TER/chrF scores by $100$. These are the only metrics in the harness that are scaled (accuracy, ROUGE, SARI, etc. are not). So, to make everything consistent for now, we can re-scale BLEU back to its "natural" units in $[0, 1]$ and follow up with an optional per-metric "results-formatter". What do you think?
I think this suggestion makes a lot of sense. Additionally, it would be nice to have the option to get rounded answers, e.g., 17.7%
.
it's confusing bleu scores are 0-100 & rouge 0-1 in this repo; I think either all scores should 0-100 or 0-1, probably the former