Open Muennighoff opened 1 year ago
I probably would not recommend it for Spanish or any other "normal" spaced lang in the current state. The default tokenizer used in rouge_scorer replaces non-alphanumeric chars (English) with spaces, so, for example, the text "Cristóbal está ayudando a su Abuela" would be converted to "Cristbal est ayudando a su Abuela"; removing the ó
and á
. See the tokenize
definition here:
https://github.com/google-research/google-research/blob/0aa035ff363066089612fb37e3e137a71cadb9c0/rouge/tokenize.py#L50-L61
Though, if you could loosen the non_alpha_numeric pattern to ignore accented letters etc. it should be fine.
It says
Multi-lingual ROUGE is unsupported as general token splitting is absent from [rouge-score](https://github.com/google-research/google-research/tree/master/rouge). For multi-lingual tasks, please ignore rouge metrics until this is resolved. NOTE: English works as intended.
, but it also works for e.g. Spanish and other languages that split on space like English, right?cc @jon-tow