Closed Muennighoff closed 3 years ago
essentially we'd need to add the packages for chinese, jap, kor, thai listed somewhere on this page: https://spacy.io/usage/models
& then check for the lang and tokenize it
I'm hesitant to add additional packages just for those langs to do the tokenization (To calculate BLEU score correctly, we need 私はシンガポールが好きです。 ---> 私 は シンガポール が 好き です 。)
I was thinking if it wouldn't be possible to just train a tiny gpt to do the space tokenization; we could then just download it from hug face and use it before calculating the bleu score - Do you that would make sense @leogao2? I'd love to try it out
That sounds significantly more complicated than just using existing segmentation packages. I think adding more dependencies for segmentation instead is fine.
Even if they're GPL licensed?
Can't we use jieba/nagisa for zh/jp? Both of those are MIT
Yep, but for Korean I was gonna use https://github.com/LuminosoInsight/mecab-ko-dic which requires Mecab which is GPL
KoNLPy is also GPL (https://konlpy.org/en/latest/#license)
Does ko need segmentation? I thought it uses spaces
Hm my knowledge of Korean is unfort. immensely limited, but in this repo using python main.py --model m2m --weights facebook/m2m100_418M --data en-ko --sample 500
I get
BLEU: 36.651343611373036 with tokenization & BLEU 2.9324395566566337 without (rmving it in the data folder).
E.g. a sentence like 그래서 이것은 우리가 알고있는 것을 어떻게 알고 있는지에 대한 이야기입니다.
is turned into
그래서 이것 은 우리 가 알 고 있 는 것 을 어떻게 알 고 있 는지 에 대한 이야기 입니다 .
I think tho we can ignore it for now - even though the BLEU may be a bit lower for Korean, relative comparisons should still make sense as long as the BLEU is calculated the same way.
I'll open a PR for Jap & Zh - Other languages like Thai, we can do lateron ad-hoc i think.
Merged in 214. Note: Consider tokenization for Thai in the future when available in WMT & possibly Korean if it's raised as an issue
We don't tokenize JP, ZH & co prior to computing the BLEU Score, which means its computed on the entire sentence, as JP & ZH don't use spaces;
I can fix this when I have some time, just noting it here for now~