[NEXT STEPS] The supports for multi East-Asia languages (Japanese, Korean and Chinese) full text search.

This project has supported the Chinese, Japanese and Korean typographies. But there still a deficient in full-text search ability. It can not process the Japanese and Korean language sentences will indexing the search database.

I have researched some points in Japanese processing, the features is completly different from Latin-like language and Chinese:

Japanese words has two writting patterns: Kanji and Kana. In other words, you need to enroll two words disperately. For example, the "Japanese" has 「にほんご」and 「日本語」 two forms.
Japanese words has more tense and morphological changes.
Some Japanese sentences written purely in Chinese characters also conform to the Chinese expression method, and it is difficult to distinguish whether they are Japanese sentences or Chinese sentences.

I checked many tokenizers in the open source community, but it seems that there is very little content in this area in the Japanese community, and the information is relatively poor.

I only found tiny-segmenter.js and kuromoji.js, but they have been poorly maintained for many years.

I'm seeking more other ideas now...

PrinOrange / nextjs-lexical-blog

[NEXT STEPS] The supports for multi East-Asia languages (Japanese, Korean and Chinese) full text search. #3