Word segmenter for Far east Asia languages

aethanyc / icu4x

Solving i18n for client-side and resource-constrained environments.

Other

3 stars 0 forks source link

Word segmenter for Far east Asia languages #4

Closed makotokato closed 2 years ago

makotokato commented 3 years ago

Actually, LSTM segmenter may be focused for East Asian Language such as Thai. But I am not sure for Japanese and Chinese. Since UAX#29 doesn't define word segmenter for Chinese and Japanese, ICU uses dictionary-based segmenter.

Can LSTM use for CJ?
Is there another segmenter implementations to use machine learning for CJ?
Should we use dictionary-model for CJ like ICU? Dictionary-based implementation cannot covert all words.

aethanyc commented 3 years ago

@makotokato Another question is: Is there any customization rules or optimization for Chinese and Japanese, which are not in the spec, but are in current gecko's lwbrk that we want to support?

makotokato commented 3 years ago

Is there any customization rules or optimization for Chinese and Japanese, which are not in the spec,

Sentence segmentation. Sentence in Japanese depends on contextual, so it cannot define the rule.

in current gecko's lwbrk that we want to support?

lwbrk requires word and line only. For word segmentation, we need any way to implement it (dictionary? or ml?). Dictionary won't cover all word segmenter case in Japanese unfortunately.

aethanyc commented 2 years ago

We've tracked this issue in https://github.com/unicode-org/icu4x/issues/1033.