djcb / mu

maildir indexer/searcher + emacs mail client + guile bindings
http://www.djcbsoftware.nl/code/mu
GNU General Public License v3.0
1.62k stars 391 forks source link

cannot search japanese text longer than 2 chars #1428

Closed fabiodl closed 1 year ago

fabiodl commented 5 years ago

When export XAPIAN_CJK_NGRAM=1 is used before indexing and searching (I tried yes in place of 1 and no changes could be observed) mu find is unable of searching for japanese text longer than 2 chars (mu: no matches for search expression (4) ).

For instance, if "このデータを検索できるのかな” is present, "デー” can be found, but "データ” can not be found.

Behavior confirmed for the following setup: Ubuntu 18.04.1 mu version : 1.3.2 Xapian version : 1.4.11

A quick search online shows that the same happens on completely different architecture and probably versions http://gcg00467.xii.jp/wp/archives/1749

On exactly the same maildir, the same shell (with the same XAPIAN_CJK_NGRAM=1) notmuch correctly indexes and retrieves mails where the query is longer than 2 chars.

panjie commented 5 years ago

It is a know issue of xapian. I made a workaround by break the cjk strings in queries into bi-grams in my mu4e's addon project mu4e-goodies

fabiodl commented 5 years ago

thank you panjie, I will have a look at it

djcb commented 3 years ago

Can you provide an email message where this happens (with specifically what to search for, since unfortuntely I do not read Japanese)? We could add it as a unit-test.

ychubachi commented 3 years ago

Hi djcb! This is my case. I set XAPIAN_CJK_NGRAM=1 and let us assume that I have four mails those subjects are following.

  1. サーバがダウンしました
  2. スポンサーシップ募集
  3. サービス開始について
  4. ショルダーバック

When I want to find 'サーバ' which means 'server' in Japanese, the correct answer shall be 1. only.

Now I try a) mu find subject:サーバ -> no matches b) mu find subject:サー -> matches 1. 2. and 3. c) mu find subject:サ -> no matches d) mu find subject:ーバ -> matches 1. and 4. e) mu find subject:サー and subject:ーバ -> matches 1. <- BINGO!

So, if I want find the Japanese word which are more than 3 characters, I must divide the word into several 2-grams, then concatenate them with 'and' operators like following.

mu find subject:あいうえお -> NG mu find subject:あい and subject:いう and subject:うえ and subject:えお -> OK... but...

It might be the fundamental solution that xapian introduces Japanese morphological analysis tool like

MeCab: Yet Another Part-of-Speech and Morphological Analyzer. https://taku910.github.io/mecab/

Welcome to janome’s documentation! (English) — Janome v0.4 documentation (en) https://mocobeta.github.io/janome/en/

But if we had some support to divide Japanese word into 2-grams and connect them with ‘and’ operators, we would be more happier!

fabiodl commented 3 years ago

What does notmuch use? Notmuch is able to deal with cjk

ychubachi commented 3 years ago

Hi fabiodl! Unfortunately, I have never used notmuch yet. I will check it later.

djcb commented 3 years ago

I sent myself an email and at least for me it works:

$ mu find サーバがダウンしました 
2021-11-12T16:42:37 EET Yoshihide Chubachi <notifications@github.com> Re: [djcb/mu] cannot search japanese text longer than 2 chars (#1428)
2021-11-12T19:29:45 EET "Dirk-Jan C. Binnema" <djcb@djcbsoftware.nl> サーバがダウンしました

This is with Xapian 1.4.18 (and I'm not even setting XAPIAN_CJK_NGRAM)

djcb commented 3 years ago

I'm using:

$ echo $LANG
en_DK.utf8

@ychubachi : are you using a UTF-8 encoding?

ychubachi commented 3 years ago

Hi djcb!

I use UTF-8. But when you do not set XAPIAN_CJK_NGRAM, the situation becomes different.

Because xapian does not know how to tokenize Japanese sentence, it indexes whole of the sentence or something sliced by some obvious delimiters like '、", "。".

The correct tokenized result is expected like サーバ/が/ダウン/しまし/た Please try to find the word 'サーバ' or 'ダウン' only in that case. ('ダウン' means 'down').

ychubachi commented 3 years ago

I used notmuch and found that Japanese search worked fine when XAPIAN_CJK_NGRAM=1.

I tested the effect of XAPIAN_CJK_NGRAM variable.

$ XAPIAN_CJK_NGRAM= notmuch search subject:サーバ | wc -l
0
$ XAPIAN_CJK_NGRAM=1 notmuch search subject:サーバ | wc -l
537

On the other hand, mu do not seem to be effected by the variable.

$ XAPIAN_CJK_NGRAM= mu find subject:サーバ
error: no matches for search expression
$ XAPIAN_CJK_NGRAM=1 mu find subject:サーバ
error: no matches for search expression

I also so tried simplesearch.rb script at https://xapian.org/docs/bindings/ruby

$ XAPIAN_CJK_NGRAM= ruby simplesearch.rb ~/Maildir/.notmuch/xapian/ サーバ
Parsed query is: Query(サーバ@1)
0 results found.
Matches 1-0:
0
$ XAPIAN_CJK_NGRAM=1 ruby simplesearch.rb ~/Maildir/.notmuch/xapian/ サーバ
Parsed query is: Query((サ@1 AND サー@1 AND ー@1 AND ーバ@1 AND バ@1))
200 results found.
Matches 1-10:
10
1: 100% docid=122280 []
2: 99% docid=135887 []
3: 99% docid=73195 []
4: 99% docid=61144 []
5: 99% docid=61146 []
6: 99% docid=86053 []
7: 99% docid=8456 []
8: 99% docid=44840 []
9: 99% docid=155301 []
10: 99% docid=115241 []

It seems the Japanese word is sliced and combined by xapian library if XAPIAN_CJK_NGRAM=1.

djcb commented 2 years ago

Thanks, that clarifies. I've added a some test cases for this; they do not pass yet, but it gives an automated way to test at least.

ychubachi commented 2 years ago

Thanks a lot!

djcb commented 1 year ago

Good news: mu 1.11.20 (and a little before) can now use Xapian's NGRAM support for this; see the new --support-ngrams option for mu init, and the test_ngrams unit test.