Open PrinOrange opened 7 months ago
Considering a Japanese word has two forms, Kanji and Kana. So when indexing text by words, what they have in common is that they can all use Roman letters as unique identifiers. My idea is to first segment the Japanese text sentences, and then convert the obtained words, whether they are kana or kanji, into romanization. Then the text is indexed based on the romanization. This solves the problem of matching two forms of Japanese words.
This project has supported the Chinese, Japanese and Korean typographies. But there still a deficient in full-text search ability. It can not process the Japanese and Korean language sentences will indexing the search database.
I have researched some points in Japanese processing, the features is completly different from Latin-like language and Chinese:
I checked many tokenizers in the open source community, but it seems that there is very little content in this area in the Japanese community, and the information is relatively poor.
I only found tiny-segmenter.js and kuromoji.js, but they have been poorly maintained for many years.
I'm seeking more other ideas now...