Closed sssslang closed 7 years ago
Can you give an example of such a CJK search? Thanks.
(I mean, a sample message that doesn't work well for you). Thank!
Long time without comment... closing this.
Hello Dirk,
Thank you for the hard work on mu
& mu4e
. I sincerely ask you to re-open this issuse, and I will give more details beblow.
I'm a new user from China. I find that the exact same problem reported by @sssslang still exist, which makes mu
almost unusable for me, since most of my emails are in Chinese.
In the following, I demostrate the problem with two little test emails:
test_en.eml
):Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
From: Sender <sender@example.com>
To: Recipient <recipient@example.com>
Subject: Test email
An test email in English.
test_cn.eml
):Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
From: Sender <sender@example.com>
To: Recipient <recipient@example.com>
Subject: =?utf-8?b?5rWL6K+V6YKu5Lu277ybVGVzdCBlbWFpbA==?=
5Lit5paH5rWL6K+V6YKu5Lu244CCClRlc3QgZW1haWwgaW4gQ2hpbmVzZS4=
> mu view ./maildir/cur/test_en.eml
From: Sender <sender@example.com>
To: Recipient <recipient@example.com>
Subject: Test email
An test email in English.
> mu view ./maildir/cur/test_cn.eml
From: Sender <sender@example.com>
To: Recipient <recipient@example.com>
Subject: 测试邮件;Test email
中文测试邮件。
Test email in Chinese.
> mu find --muhome=./mu ""
Thu 01 Jan 1970 08:00:00 AM CST Sender <sender@example.com> Test email
Thu 01 Jan 1970 08:00:00 AM CST Sender <sender@example.com> 测试邮件;Test email
> mu find --muhome=./mu "中文"
mu: no matches for search expression (4)
> mu find --muhome=./mu "测试"
mu: no matches for search expression (4)
> mu find --muhome=./mu "s:测试"
Thu 01 Jan 1970 08:00:00 AM CST Sender <sender@example.com> 测试邮件;Test email
mu find --muhome=./mu "Chinese"
Thu 01 Jan 1970 08:00:00 AM CST Sender <sender@example.com> 测试邮件;Test email
> notmuch --config=./notmuch-config search --output=summary "*"
thread:0000000000000001 1970-01-01 [1/1] Sender; Test email (new)
thread:0000000000000002 1970-01-01 [1/1] Sender; 测试邮件;Test email (new)
> notmuch --config=./notmuch-config search --output=summary "subject:测试"
thread:0000000000000002 1970-01-01 [1/1] Sender; 测试邮件;Test email (new)
> notmuch --config=./notmuch-config search --output=summary "中文"
thread:0000000000000002 1970-01-01 [1/1] Sender; 测试邮件;Test email (new)
Note: the english word Chinese
and the Chinese word 中文
only exists in the body of the Chinese test email.
It's my pleasure if I can provide any further information to help solve this problem.
Best regards!
Sorry that I forget the mu
version information.
I'm using the development version which pulled from this github on 2016-01-28. I just pulled the latest version, but all the updates are mu4e
-related.
> mu --version
mu (mail indexer/searcher) version 0.9.17
Copyright (C) 2008-2015 Dirk-Jan C. Binnema
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Regards!
@liweitianux: ah, thanks! Esp. the example emails are useful.
djcb@borealis:Sources/mu <4>:% XAPIAN_CJK_NGRAM=yes mu find "测试"
mu: no matches for search expression (4)
djcb@borealis:Sources/mu <4>:% mu find "测试"
1970-01-01T02:00:00 EET Sender <sender@example.com> 测试邮件;Test email
Interestingly, it seems to work when XAPIAN_CJK_NGRAM
is not set when querying (it was set during indexing). When XAPIAN_CJK_NGRAM
is set, the Xapian query-parser seems to see the characters as separate search terms.
@djcb This is quite interesting. And I can confirm your finding. However, it is strange and annoying that mu
/Xapian
behavior differently when searching Chinese/CJK characters with or without the query prefix (e.g, subject:
).
I set the environment variable XAPIAN_CJK_NGRAM=1
in my shell (zsh
) configurations, and then I index my emails.
[1] > env | grep CJK
XAPIAN_CJK_NGRAM=1
I can confirm that the Chinese/CJK search works correctly for mu
when the environment variable XAPIAN_CJK_NGRAM
is not set or empty (I also tested with my big mail archive, and it works):
[2] > env XAPIAN_CJK_NGRAM="" mu find --muhome=./mu chinese 中文
Thu 01 Jan 1970 08:00:00 AM CST Sender <sender@example.com> 测试邮件;Test email
[3] > env XAPIAN_CJK_NGRAM="" mu find --muhome=./mu 中文 OR english
Thu 01 Jan 1970 08:00:00 AM CST Sender <sender@example.com> Test email
Thu 01 Jan 1970 08:00:00 AM CST Sender <sender@example.com> 测试邮件;Test email
Unfortunately, for proper Chinese/CJK query, the environment variable XAPIAN_CJK_NGRAM
should be set, otherwise the Chinese/CJK query string does NOT been segmented. Therefore Xapian search the database with the whole supplied Chinese/CJK query string as is, and returns wrong/no results. (The Xapian database was built with XAPIAN_CJK_NGRAM
been set, so the Chinese/CJK strings/sentences are properly segmented.) For example:
[4] > env XAPIAN_CJK_NGRAM=yes mu find --muhome=./mu subject:测试邮件
Thu 01 Jan 1970 08:00:00 AM CST Sender <sender@example.com> 测试邮件;Test email
[5] > env XAPIAN_CJK_NGRAM=yes mu find --muhome=./mu subject:测试 subject:邮件
Thu 01 Jan 1970 08:00:00 AM CST Sender <sender@example.com> 测试邮件;Test email
[6] > env XAPIAN_CJK_NGRAM="" mu find --muhome=./mu subject:测试邮件
mu: no matches for search expression (4)
[7] > env XAPIAN_CJK_NGRAM="" mu find --muhome=./mu subject:测试 subject:邮件
Thu 01 Jan 1970 08:00:00 AM CST Sender <sender@example.com> 测试邮件;Test email
[8] > env XAPIAN_CJK_NGRAM="" mu find --muhome=./mu 中文测试
mu: no matches for search expression (4)
[9] > env XAPIAN_CJK_NGRAM="" mu find --muhome=./mu 中文 测试
Thu 01 Jan 1970 08:00:00 AM CST Sender <sender@example.com> 测试邮件;Test email
Please note the two cases ([6] and [8]) that mu
does not find the Chinese test email with both XAPIAN_CJK_NGRAM
not been set and the Chinese query string not been segmented. And if I manually segment the Chinese query string, mu
can give me the correct search results (e.g., [7] and [9]).
Cheers!
Hmm, that's weird. So, indexing should be done with XAPIAN_CJK_NGRAM=yes.
However, after that, some queries only work with XAPIAN_CJK_NGRAM empty, and some others only with XAPIAN_CJK_NGRAM non-empty.
I have to think a bit about this...
Dear @djcb,
I recently took further investigations to mu
and xapian
, and found that this issue is due to the wrong combination behavior applied to the tokenized CJK query terms with respect to the prefixes. I just report this bug to the Xapian community (Ticket 719). I'm looking forward to it being fixed.
Meanwhile, I also provide some more details here:
[1] > mu find --muhome=mu --format=xquery b:中文
Xapian::Query((B中:(pos=1) AND B中文:(pos=1) AND B文:(pos=1)))
[2] > env XAPIAN_CJK_NGRAM="" mu find --muhome=mu --format=xquery b:中文
Xapian::Query(B中文:(pos=1))
[3] > mu find --muhome=mu --format=xquery 中文
Xapian::Query((H中:(pos=1) AND B中:(pos=1) AND C中:(pos=1) AND E中:(pos=1) AND \
J中:(pos=1) AND F中:(pos=1) AND M中:(pos=1) AND Y中:(pos=1) AND \
I中:(pos=1) AND S中:(pos=1) AND T中:(pos=1) AND U中:(pos=1) AND \
X中:(pos=1) AND D中:(pos=1) AND G中:(pos=1) AND P中:(pos=1) AND \
Z中:(pos=1) AND V中:(pos=1) AND W中:(pos=1) AND \
H中文:(pos=1) AND B中文:(pos=1) AND C中文:(pos=1) AND \
E中文:(pos=1) AND J中文:(pos=1) AND F中文:(pos=1) AND \
M中文:(pos=1) AND Y中文:(pos=1) AND I中文:(pos=1) AND \
S中文:(pos=1) AND T中文:(pos=1) AND U中文:(pos=1) AND \
X中文:(pos=1) AND D中文:(pos=1) AND G中文:(pos=1) AND \
P中文:(pos=1) AND Z中文:(pos=1) AND V中文:(pos=1) AND \
W中文:(pos=1) AND \
H文:(pos=1) AND B文:(pos=1) AND C文:(pos=1) AND E文:(pos=1) AND \
J文:(pos=1) AND F文:(pos=1) AND M文:(pos=1) AND Y文:(pos=1) AND \
I文:(pos=1) AND S文:(pos=1) AND T文:(pos=1) AND U文:(pos=1) AND \
X文:(pos=1) AND D文:(pos=1) AND G文:(pos=1) AND P文:(pos=1) AND \
Z文:(pos=1) AND V文:(pos=1) AND W文:(pos=1)))
[4] > env XAPIAN_CJK_NGRAM="" mu find --muhome=mu --format=xquery 中文
Xapian::Query((H中文:(pos=1) OR B中文:(pos=1) OR C中文:(pos=1) OR \
E中文:(pos=1) OR J中文:(pos=1) OR F中文:(pos=1) OR \
M中文:(pos=1) OR Y中文:(pos=1) OR I中文:(pos=1) OR \
S中文:(pos=1) OR T中文:(pos=1) OR U中文:(pos=1) OR \
X中文:(pos=1) OR D中文:(pos=1) OR G中文:(pos=1) OR \
P中文:(pos=1) OR Z中文:(pos=1) OR V中文:(pos=1) OR \
W中文:(pos=1)))
As we can see, with XAPIAN_CJK_NGRAM
set and without query prefix, the same tokenized CJK term (e.g., 中
) is wrongly AND
combined with respect to each prefix, which should instead be OR
combined.
On the other hand, without XAPIAN_CJK_NGRAM
set, the CJK tokenization does not happen at all, thus the query term is correctly OR
combined for all the prefixes.
However, I don't know how notmuch
properly/correctly deal with this CJK query issue, which does not have this problem as I reported previously.
[5] > env NOTMUCH_DEBUG_QUERY=1 NOTMUCH_CONFIG="./notmuch-config" notmuch search "中文"
Query string is:
中文
Exclude query is:
Xapian::Query((Kdeleted OR Kspam))
Final query is:
Xapian::Query(((Tmail AND 中:(pos=1) AND 中文:(pos=1) AND 文:(pos=1)) AND_NOT \
(Kdeleted OR Kspam)))
Query string is:
thread:0000000000000002
Exclude query is:
Xapian::Query()
Final query is:
Xapian::Query((Tmail AND 0 * G0000000000000002))
thread:0000000000000002 1970-01-01 [1/1] Sender; 测试邮件;Test email (new)
Best regards! Aly
Oh, good catch! The xapian result was quite mysterious...
mu does a bit of massaging of the input data as well as the queries, which makes them a bit different from notmuch I suppose.
Hi @djcb ,
The Xapian issue 719 has been fixed, and released in Xapian v1.4.1 and v1.2.25.
I tried build recent mu with Xapian v1.4.1, and checked my previous tests. I think this issue is fixed and can be closed.
Following is the new testing results:
[1] > mu find --muhome=mu "中文"
Thu 01 Jan 1970 08:00:00 AM CST Sender <sender@example.com> 测试邮件;Test email
[2] > mu find --muhome=mu "测试"
Thu 01 Jan 1970 08:00:00 AM CST Sender <sender@example.com> 测试邮件;Test email
[3] > mu find --muhome=mu "s:测试"
Thu 01 Jan 1970 08:00:00 AM CST Sender <sender@example.com> 测试邮件;Test email
[4] > mu find --muhome=mu --format=xquery "b:中文"
Query((B中@1 AND B中文@1 AND B文@1))
[5] > mu find --muhome=mu --format=xquery "中文"
Query(((H中@1 AND H中文@1 AND H文@1) OR
(B中@1 AND B中文@1 AND B文@1) OR
(C中@1 AND C中文@1 AND C文@1) OR
(E中@1 AND E中文@1 AND E文@1) OR
(J中@1 AND J中文@1 AND J文@1) OR
(F中@1 AND F中文@1 AND F文@1) OR
(M中@1 AND M中文@1 AND M文@1) OR
(Y中@1 AND Y中文@1 AND Y文@1) OR
(I中@1 AND I中文@1 AND I文@1) OR
(S中@1 AND S中文@1 AND S文@1) OR
(T中@1 AND T中文@1 AND T文@1) OR
(U中@1 AND U中文@1 AND U文@1) OR
(X中@1 AND X中文@1 AND X文@1) OR
(D中@1 AND D中文@1 AND D文@1) OR
(G中@1 AND G中文@1 AND G文@1) OR
(P中@1 AND P中文@1 AND P文@1) OR
(Z中@1 AND Z中文@1 AND Z文@1) OR
(V中@1 AND V中文@1 AND V文@1) OR
(W中@1 AND W中文@1 AND W文@1)))
Thanks to you and the Xapian community!
@liweitianux: oh, that's nice to hear, thanks! Closing...
After enable CJK support using a environment variable1, I've found it's inconvenience when searching CJK text. When search a word in English, I can get all matches without specify which field to search, e.g.
mu find foo
instead ofmu find s:foo OR b:foo
. But it doesn't work when search CJK text. I mean I need to tell mu which field to search explicitly, likemu find s:CJK_WORD
. I'm not familiar with xapian, and didn't know where's the problem.