WMT JP/ZH/KO Tokenization

EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.

https://www.eleuther.ai

MIT License

6.56k stars 1.74k forks source link

WMT JP/ZH/KO Tokenization #212

Closed Muennighoff closed 3 years ago

Muennighoff commented 3 years ago

We don't tokenize JP, ZH & co prior to computing the BLEU Score, which means its computed on the entire sentence, as JP & ZH don't use spaces;

I can fix this when I have some time, just noting it here for now~

Muennighoff commented 3 years ago

essentially we'd need to add the packages for chinese, jap, kor, thai listed somewhere on this page: https://spacy.io/usage/models

& then check for the lang and tokenize it

Muennighoff commented 3 years ago

I'm hesitant to add additional packages just for those langs to do the tokenization (To calculate BLEU score correctly, we need 私はシンガポールが好きです。　 ---> 私　は　シンガポール　が　好き　です　。)

I was thinking if it wouldn't be possible to just train a tiny gpt to do the space tokenization; we could then just download it from hug face and use it before calculating the bleu score - Do you that would make sense @leogao2? I'd love to try it out

leogao2 commented 3 years ago

That sounds significantly more complicated than just using existing segmentation packages. I think adding more dependencies for segmentation instead is fine.

Muennighoff commented 3 years ago

Even if they're GPL licensed?

leogao2 commented 3 years ago

Can't we use jieba/nagisa for zh/jp? Both of those are MIT

Muennighoff commented 3 years ago

Yep, but for Korean I was gonna use https://github.com/LuminosoInsight/mecab-ko-dic which requires Mecab which is GPL

KoNLPy is also GPL (https://konlpy.org/en/latest/#license)

leogao2 commented 3 years ago

Does ko need segmentation? I thought it uses spaces

Muennighoff commented 3 years ago

Hm my knowledge of Korean is unfort. immensely limited, but in this repo using python main.py --model m2m --weights facebook/m2m100_418M --data en-ko --sample 500 I get BLEU: 36.651343611373036 with tokenization & BLEU 2.9324395566566337 without (rmving it in the data folder).

E.g. a sentence like 그래서 이것은 우리가 알고있는 것을 어떻게 알고 있는지에 대한 이야기입니다. is turned into 그래서 이것 은 우리 가 알 고 있 는 것 을 어떻게 알 고 있 는지 에 대한 이야기 입니다 .

I think tho we can ignore it for now - even though the BLEU may be a bit lower for Korean, relative comparisons should still make sense as long as the BLEU is calculated the same way.

I'll open a PR for Jap & Zh - Other languages like Thai, we can do lateron ad-hoc i think.

Muennighoff commented 3 years ago

Merged in 214. Note: Consider tokenization for Thai in the future when available in WMT & possibly Korean if it's raised as an issue