Closed fabiodl closed 1 year ago
It is a know issue of xapian. I made a workaround by break the cjk strings in queries into bi-grams in my mu4e's addon project mu4e-goodies
thank you panjie, I will have a look at it
Can you provide an email message where this happens (with specifically what to search for, since unfortuntely I do not read Japanese)? We could add it as a unit-test.
Hi djcb! This is my case. I set XAPIAN_CJK_NGRAM=1 and let us assume that I have four mails those subjects are following.
When I want to find 'サーバ' which means 'server' in Japanese, the correct answer shall be 1. only.
Now I try a) mu find subject:サーバ -> no matches b) mu find subject:サー -> matches 1. 2. and 3. c) mu find subject:サ -> no matches d) mu find subject:ーバ -> matches 1. and 4. e) mu find subject:サー and subject:ーバ -> matches 1. <- BINGO!
So, if I want find the Japanese word which are more than 3 characters, I must divide the word into several 2-grams, then concatenate them with 'and' operators like following.
mu find subject:あいうえお -> NG mu find subject:あい and subject:いう and subject:うえ and subject:えお -> OK... but...
It might be the fundamental solution that xapian introduces Japanese morphological analysis tool like
MeCab: Yet Another Part-of-Speech and Morphological Analyzer. https://taku910.github.io/mecab/
Welcome to janome’s documentation! (English) — Janome v0.4 documentation (en) https://mocobeta.github.io/janome/en/
But if we had some support to divide Japanese word into 2-grams and connect them with ‘and’ operators, we would be more happier!
What does notmuch use? Notmuch is able to deal with cjk
Hi fabiodl! Unfortunately, I have never used notmuch yet. I will check it later.
I sent myself an email and at least for me it works:
$ mu find サーバがダウンしました
2021-11-12T16:42:37 EET Yoshihide Chubachi <notifications@github.com> Re: [djcb/mu] cannot search japanese text longer than 2 chars (#1428)
2021-11-12T19:29:45 EET "Dirk-Jan C. Binnema" <djcb@djcbsoftware.nl> サーバがダウンしました
This is with Xapian 1.4.18 (and I'm not even setting XAPIAN_CJK_NGRAM
)
I'm using:
$ echo $LANG
en_DK.utf8
@ychubachi : are you using a UTF-8 encoding?
Hi djcb!
I use UTF-8. But when you do not set XAPIAN_CJK_NGRAM, the situation becomes different.
Because xapian does not know how to tokenize Japanese sentence, it indexes whole of the sentence or something sliced by some obvious delimiters like '、", "。".
The correct tokenized result is expected like サーバ/が/ダウン/しまし/た Please try to find the word 'サーバ' or 'ダウン' only in that case. ('ダウン' means 'down').
I used notmuch and found that Japanese search worked fine when XAPIAN_CJK_NGRAM=1.
I tested the effect of XAPIAN_CJK_NGRAM variable.
$ XAPIAN_CJK_NGRAM= notmuch search subject:サーバ | wc -l
0
$ XAPIAN_CJK_NGRAM=1 notmuch search subject:サーバ | wc -l
537
On the other hand, mu do not seem to be effected by the variable.
$ XAPIAN_CJK_NGRAM= mu find subject:サーバ
error: no matches for search expression
$ XAPIAN_CJK_NGRAM=1 mu find subject:サーバ
error: no matches for search expression
I also so tried simplesearch.rb script at https://xapian.org/docs/bindings/ruby
$ XAPIAN_CJK_NGRAM= ruby simplesearch.rb ~/Maildir/.notmuch/xapian/ サーバ
Parsed query is: Query(サーバ@1)
0 results found.
Matches 1-0:
0
$ XAPIAN_CJK_NGRAM=1 ruby simplesearch.rb ~/Maildir/.notmuch/xapian/ サーバ
Parsed query is: Query((サ@1 AND サー@1 AND ー@1 AND ーバ@1 AND バ@1))
200 results found.
Matches 1-10:
10
1: 100% docid=122280 []
2: 99% docid=135887 []
3: 99% docid=73195 []
4: 99% docid=61144 []
5: 99% docid=61146 []
6: 99% docid=86053 []
7: 99% docid=8456 []
8: 99% docid=44840 []
9: 99% docid=155301 []
10: 99% docid=115241 []
It seems the Japanese word is sliced and combined by xapian library if XAPIAN_CJK_NGRAM=1.
Thanks, that clarifies. I've added a some test cases for this; they do not pass yet, but it gives an automated way to test at least.
Thanks a lot!
Good news: mu
1.11.20 (and a little before) can now use Xapian's NGRAM support for this; see the new --support-ngrams
option for mu init
, and the test_ngrams
unit test.
When export XAPIAN_CJK_NGRAM=1 is used before indexing and searching (I tried yes in place of 1 and no changes could be observed) mu find is unable of searching for japanese text longer than 2 chars (mu: no matches for search expression (4) ).
For instance, if "このデータを検索できるのかな” is present, "デー” can be found, but "データ” can not be found.
Behavior confirmed for the following setup: Ubuntu 18.04.1 mu version : 1.3.2 Xapian version : 1.4.11
A quick search online shows that the same happens on completely different architecture and probably versions http://gcg00467.xii.jp/wp/archives/1749
On exactly the same maildir, the same shell (with the same XAPIAN_CJK_NGRAM=1) notmuch correctly indexes and retrieves mails where the query is longer than 2 chars.