Improve tokenization for CJK languages

Daniel-Mietchen commented 4 years ago

Japanese example attached — the sentence 下記方法で体内への侵入を防止すること from here should be tokenized somewhat like the following, with a single pipe character standing for a word boundary, two for lexeme boundaries more generally:

下記方法|で||体内||へ||の||侵入||を||防止|する||こと

Similar issues with Korean and Chinese texts, so keeping them together for now.

Screenshot_2020-04-08 https tools wmflabs org

Artoria2e5 commented 3 years ago

I think Korean might be able to get away with this with how modern texts are written with word-spaces. I guess that would be different for more casual text?

For Chinese tokenization I usually recommend jieba. It's... good. I can't even think of another tokenizer off my head. And the dictionaries are not big -- the HMM magic takes care of unknowns.

I think there are also Japanese tokenizers in Python (truth be told, everything data-splitty-chunky is written in Python these days), but as I don't speak it I have no idea what to use. The first Google result is something called fugashi. The author looks very serious, but the dictionary is big.

fnielsen commented 3 years ago

I am unsure how to do CJK tokenization. As I understand there are no easy way like splitting on a character. So one should use a tool with a dictionary like jieba.

fnielsen / ordia

Improve tokenization for CJK languages #95