Improve CJK tokenizer support

srfrog commented 5 years ago

The current CJK tokenizer in v1.0.10 is the one included in Bleve. It has limited support and can yield extra tokens when that aren't needed. We need to use a CJK package/library specifically designed for CJK support that can handle these languages better.

For example, the term "first name" or "名字" is tokenized as "名", "字". But in this form, "字" is "name" and "字" is "word", so we have lost "first" as a token. So a fulltext/term lookup for "字" won't return the expected results. The expected term should have "名字".

Some CJK packages considered are:

https://github.com/yanyiwu/gojieba (has Bleve support)
https://github.com/huichen/sego

Refers #1421

ls84 commented 4 years ago

Yes, this issue cause unexpected search result. it makes anyofterm(text, "名字") intoanyofterm(text, "名字"), but hese two characters ”名“ and ”字“ together should be seem as a word.

This makes search very unperdicatble for chinese language, you can only use allofterm() for now, but it still treat "名字" as two separate words, which returns many extra search results.

suexcxine commented 1 year ago

Any progress on this?

dgraph-io / dgraph

Improve CJK tokenizer support #2801