bigscience-workshop / lm-evaluation-harness

A framework for few-shot evaluation of autoregressive language models.
MIT License
101 stars 30 forks source link

Rouge score #157

Open Muennighoff opened 1 year ago

Muennighoff commented 1 year ago

It says Multi-lingual ROUGE is unsupported as general token splitting is absent from [rouge-score](https://github.com/google-research/google-research/tree/master/rouge). For multi-lingual tasks, please ignore rouge metrics until this is resolved. NOTE: English works as intended., but it also works for e.g. Spanish and other languages that split on space like English, right?

cc @jon-tow

jon-tow commented 1 year ago

I probably would not recommend it for Spanish or any other "normal" spaced lang in the current state. The default tokenizer used in rouge_scorer replaces non-alphanumeric chars (English) with spaces, so, for example, the text "Cristóbal está ayudando a su Abuela" would be converted to "Cristbal est ayudando a su Abuela"; removing the ó and á. See the tokenize definition here: https://github.com/google-research/google-research/blob/0aa035ff363066089612fb37e3e137a71cadb9c0/rouge/tokenize.py#L50-L61 Though, if you could loosen the non_alpha_numeric pattern to ignore accented letters etc. it should be fine.