djcb / mu

maildir indexer/searcher + emacs mail client + guile bindings
http://www.djcbsoftware.nl/code/mu
GNU General Public License v3.0
1.61k stars 390 forks source link

behavior different when searching CJK text #123

Closed sssslang closed 7 years ago

sssslang commented 11 years ago

After enable CJK support using a environment variable1, I've found it's inconvenience when searching CJK text. When search a word in English, I can get all matches without specify which field to search, e.g. mu find foo instead of mu find s:foo OR b:foo. But it doesn't work when search CJK text. I mean I need to tell mu which field to search explicitly, like mu find s:CJK_WORD. I'm not familiar with xapian, and didn't know where's the problem.

djcb commented 11 years ago

Can you give an example of such a CJK search? Thanks.

djcb commented 11 years ago

(I mean, a sample message that doesn't work well for you). Thank!

djcb commented 11 years ago

Long time without comment... closing this.

liweitianux commented 8 years ago

Hello Dirk,

Thank you for the hard work on mu & mu4e. I sincerely ask you to re-open this issuse, and I will give more details beblow.

I'm a new user from China. I find that the exact same problem reported by @sssslang still exist, which makes mu almost unusable for me, since most of my emails are in Chinese.

In the following, I demostrate the problem with two little test emails:

1. Test emails

Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
From: Sender <sender@example.com>
To: Recipient <recipient@example.com>
Subject: Test email

An test email in English.
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
From: Sender <sender@example.com>
To: Recipient <recipient@example.com>
Subject: =?utf-8?b?5rWL6K+V6YKu5Lu277ybVGVzdCBlbWFpbA==?=

5Lit5paH5rWL6K+V6YKu5Lu244CCClRlc3QgZW1haWwgaW4gQ2hpbmVzZS4=

2. Contents of the test emails

> mu view ./maildir/cur/test_en.eml
From: Sender <sender@example.com>
To: Recipient <recipient@example.com>
Subject: Test email
An test email in English.
> mu view ./maildir/cur/test_cn.eml
From: Sender <sender@example.com>
To: Recipient <recipient@example.com>
Subject: 测试邮件;Test email
中文测试邮件。
Test email in Chinese.

3. The mu index database

> mu find --muhome=./mu ""
Thu 01 Jan 1970 08:00:00 AM CST Sender <sender@example.com> Test email
Thu 01 Jan 1970 08:00:00 AM CST Sender <sender@example.com> 测试邮件;Test email

4. Search problem with Chinese characters/words

> mu find --muhome=./mu "中文"   
mu: no matches for search expression (4)
> mu find --muhome=./mu "测试"   
mu: no matches for search expression (4)
> mu find --muhome=./mu "s:测试" 
Thu 01 Jan 1970 08:00:00 AM CST Sender <sender@example.com> 测试邮件;Test email
mu find --muhome=./mu "Chinese"  
Thu 01 Jan 1970 08:00:00 AM CST Sender <sender@example.com> 测试邮件;Test email

5. notmuch do NOT have such Chinese search problem

> notmuch --config=./notmuch-config search --output=summary "*"
thread:0000000000000001   1970-01-01 [1/1] Sender; Test email (new)
thread:0000000000000002   1970-01-01 [1/1] Sender; 测试邮件;Test email (new)
> notmuch --config=./notmuch-config search --output=summary "subject:测试"
thread:0000000000000002   1970-01-01 [1/1] Sender; 测试邮件;Test email (new)
> notmuch --config=./notmuch-config search --output=summary "中文"        
thread:0000000000000002   1970-01-01 [1/1] Sender; 测试邮件;Test email (new)

Note: the english word Chinese and the Chinese word 中文 only exists in the body of the Chinese test email.

It's my pleasure if I can provide any further information to help solve this problem.

Best regards!

liweitianux commented 8 years ago

Sorry that I forget the mu version information.

I'm using the development version which pulled from this github on 2016-01-28. I just pulled the latest version, but all the updates are mu4e-related.

> mu --version
mu (mail indexer/searcher) version 0.9.17
Copyright (C) 2008-2015 Dirk-Jan C. Binnema
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Regards!

djcb commented 8 years ago

@liweitianux: ah, thanks! Esp. the example emails are useful.

djcb commented 8 years ago
djcb@borealis:Sources/mu <4>:% XAPIAN_CJK_NGRAM=yes mu find "测试"
mu: no matches for search expression (4)
djcb@borealis:Sources/mu <4>:% mu find "测试"
1970-01-01T02:00:00 EET Sender <sender@example.com> 测试邮件;Test email              

Interestingly, it seems to work when XAPIAN_CJK_NGRAM is not set when querying (it was set during indexing). When XAPIAN_CJK_NGRAM is set, the Xapian query-parser seems to see the characters as separate search terms.

liweitianux commented 8 years ago

@djcb This is quite interesting. And I can confirm your finding. However, it is strange and annoying that mu/Xapian behavior differently when searching Chinese/CJK characters with or without the query prefix (e.g, subject:).

I set the environment variable XAPIAN_CJK_NGRAM=1 in my shell (zsh) configurations, and then I index my emails.

[1] > env | grep CJK
XAPIAN_CJK_NGRAM=1

I can confirm that the Chinese/CJK search works correctly for mu when the environment variable XAPIAN_CJK_NGRAM is not set or empty (I also tested with my big mail archive, and it works):

[2] > env XAPIAN_CJK_NGRAM="" mu find --muhome=./mu chinese 中文
Thu 01 Jan 1970 08:00:00 AM CST Sender <sender@example.com> 测试邮件;Test email
[3] > env XAPIAN_CJK_NGRAM="" mu find --muhome=./mu 中文 OR english
Thu 01 Jan 1970 08:00:00 AM CST Sender <sender@example.com> Test email
Thu 01 Jan 1970 08:00:00 AM CST Sender <sender@example.com> 测试邮件;Test email

Unfortunately, for proper Chinese/CJK query, the environment variable XAPIAN_CJK_NGRAM should be set, otherwise the Chinese/CJK query string does NOT been segmented. Therefore Xapian search the database with the whole supplied Chinese/CJK query string as is, and returns wrong/no results. (The Xapian database was built with XAPIAN_CJK_NGRAM been set, so the Chinese/CJK strings/sentences are properly segmented.) For example:

[4] > env XAPIAN_CJK_NGRAM=yes mu find --muhome=./mu subject:测试邮件
Thu 01 Jan 1970 08:00:00 AM CST Sender <sender@example.com> 测试邮件;Test email
[5] > env XAPIAN_CJK_NGRAM=yes mu find --muhome=./mu subject:测试 subject:邮件
Thu 01 Jan 1970 08:00:00 AM CST Sender <sender@example.com> 测试邮件;Test email
[6] > env XAPIAN_CJK_NGRAM="" mu find --muhome=./mu subject:测试邮件
mu: no matches for search expression (4)
[7] > env XAPIAN_CJK_NGRAM="" mu find --muhome=./mu subject:测试 subject:邮件
Thu 01 Jan 1970 08:00:00 AM CST Sender <sender@example.com> 测试邮件;Test email
[8] > env XAPIAN_CJK_NGRAM="" mu find --muhome=./mu 中文测试
mu: no matches for search expression (4)
[9] > env XAPIAN_CJK_NGRAM="" mu find --muhome=./mu 中文 测试
Thu 01 Jan 1970 08:00:00 AM CST Sender <sender@example.com> 测试邮件;Test email

Please note the two cases ([6] and [8]) that mu does not find the Chinese test email with both XAPIAN_CJK_NGRAM not been set and the Chinese query string not been segmented. And if I manually segment the Chinese query string, mu can give me the correct search results (e.g., [7] and [9]).

Cheers!

djcb commented 8 years ago

Hmm, that's weird. So, indexing should be done with XAPIAN_CJK_NGRAM=yes.

However, after that, some queries only work with XAPIAN_CJK_NGRAM empty, and some others only with XAPIAN_CJK_NGRAM non-empty.

I have to think a bit about this...

liweitianux commented 8 years ago

Dear @djcb,

I recently took further investigations to mu and xapian, and found that this issue is due to the wrong combination behavior applied to the tokenized CJK query terms with respect to the prefixes. I just report this bug to the Xapian community (Ticket 719). I'm looking forward to it being fixed.

Meanwhile, I also provide some more details here:

[1] > mu find --muhome=mu --format=xquery b:中文  
Xapian::Query((B中:(pos=1) AND B中文:(pos=1) AND B文:(pos=1)))

[2] > env XAPIAN_CJK_NGRAM="" mu find --muhome=mu --format=xquery b:中文
Xapian::Query(B中文:(pos=1))

[3] > mu find --muhome=mu --format=xquery 中文
Xapian::Query((H中:(pos=1) AND B中:(pos=1) AND C中:(pos=1) AND E中:(pos=1) AND \
               J中:(pos=1) AND F中:(pos=1) AND M中:(pos=1) AND Y中:(pos=1) AND \
               I中:(pos=1) AND S中:(pos=1) AND T中:(pos=1) AND U中:(pos=1) AND \
               X中:(pos=1) AND D中:(pos=1) AND G中:(pos=1) AND P中:(pos=1) AND \
               Z中:(pos=1) AND V中:(pos=1) AND W中:(pos=1) AND \
               H中文:(pos=1) AND B中文:(pos=1) AND C中文:(pos=1) AND \
               E中文:(pos=1) AND J中文:(pos=1) AND F中文:(pos=1) AND \
               M中文:(pos=1) AND Y中文:(pos=1) AND I中文:(pos=1) AND \
               S中文:(pos=1) AND T中文:(pos=1) AND U中文:(pos=1) AND \
               X中文:(pos=1) AND D中文:(pos=1) AND G中文:(pos=1) AND \
               P中文:(pos=1) AND Z中文:(pos=1) AND V中文:(pos=1) AND \
               W中文:(pos=1) AND \
               H文:(pos=1) AND B文:(pos=1) AND C文:(pos=1) AND E文:(pos=1) AND \
               J文:(pos=1) AND F文:(pos=1) AND M文:(pos=1) AND Y文:(pos=1) AND \
               I文:(pos=1) AND S文:(pos=1) AND T文:(pos=1) AND U文:(pos=1) AND \
               X文:(pos=1) AND D文:(pos=1) AND G文:(pos=1) AND P文:(pos=1) AND \
               Z文:(pos=1) AND V文:(pos=1) AND W文:(pos=1)))

[4] > env XAPIAN_CJK_NGRAM="" mu find --muhome=mu --format=xquery 中文  
Xapian::Query((H中文:(pos=1) OR B中文:(pos=1) OR C中文:(pos=1) OR \
               E中文:(pos=1) OR J中文:(pos=1) OR F中文:(pos=1) OR \
               M中文:(pos=1) OR Y中文:(pos=1) OR I中文:(pos=1) OR \
               S中文:(pos=1) OR T中文:(pos=1) OR U中文:(pos=1) OR \
               X中文:(pos=1) OR D中文:(pos=1) OR G中文:(pos=1) OR \
               P中文:(pos=1) OR Z中文:(pos=1) OR V中文:(pos=1) OR \
               W中文:(pos=1)))

As we can see, with XAPIAN_CJK_NGRAM set and without query prefix, the same tokenized CJK term (e.g., ) is wrongly AND combined with respect to each prefix, which should instead be OR combined. On the other hand, without XAPIAN_CJK_NGRAM set, the CJK tokenization does not happen at all, thus the query term is correctly OR combined for all the prefixes.

However, I don't know how notmuch properly/correctly deal with this CJK query issue, which does not have this problem as I reported previously.

[5] > env NOTMUCH_DEBUG_QUERY=1 NOTMUCH_CONFIG="./notmuch-config" notmuch search "中文"
Query string is:
中文
Exclude query is:
Xapian::Query((Kdeleted OR Kspam))
Final query is:
Xapian::Query(((Tmail AND 中:(pos=1) AND 中文:(pos=1) AND 文:(pos=1)) AND_NOT \
               (Kdeleted OR Kspam)))
Query string is:
thread:0000000000000002
Exclude query is:
Xapian::Query()
Final query is:
Xapian::Query((Tmail AND 0 * G0000000000000002))
thread:0000000000000002   1970-01-01 [1/1] Sender; 测试邮件;Test email (new)

Best regards! Aly

djcb commented 8 years ago

Oh, good catch! The xapian result was quite mysterious...

mu does a bit of massaging of the input data as well as the queries, which makes them a bit different from notmuch I suppose.

liweitianux commented 7 years ago

Hi @djcb ,

The Xapian issue 719 has been fixed, and released in Xapian v1.4.1 and v1.2.25.

I tried build recent mu with Xapian v1.4.1, and checked my previous tests. I think this issue is fixed and can be closed.

Following is the new testing results:

[1] > mu find --muhome=mu "中文"
Thu 01 Jan 1970 08:00:00 AM CST Sender <sender@example.com> 测试邮件;Test email

[2] > mu find --muhome=mu "测试"
Thu 01 Jan 1970 08:00:00 AM CST Sender <sender@example.com> 测试邮件;Test email

[3] > mu find --muhome=mu "s:测试"
Thu 01 Jan 1970 08:00:00 AM CST Sender <sender@example.com> 测试邮件;Test email

[4] > mu find --muhome=mu --format=xquery "b:中文"
Query((B中@1 AND B中文@1 AND B文@1))

[5] > mu find --muhome=mu --format=xquery "中文"
Query(((H中@1 AND H中文@1 AND H文@1) OR
       (B中@1 AND B中文@1 AND B文@1) OR
       (C中@1 AND C中文@1 AND C文@1) OR
       (E中@1 AND E中文@1 AND E文@1) OR
       (J中@1 AND J中文@1 AND J文@1) OR
       (F中@1 AND F中文@1 AND F文@1) OR
       (M中@1 AND M中文@1 AND M文@1) OR
       (Y中@1 AND Y中文@1 AND Y文@1) OR
       (I中@1 AND I中文@1 AND I文@1) OR
       (S中@1 AND S中文@1 AND S文@1) OR
       (T中@1 AND T中文@1 AND T文@1) OR
       (U中@1 AND U中文@1 AND U文@1) OR
       (X中@1 AND X中文@1 AND X文@1) OR
       (D中@1 AND D中文@1 AND D文@1) OR
       (G中@1 AND G中文@1 AND G文@1) OR
       (P中@1 AND P中文@1 AND P文@1) OR
       (Z中@1 AND Z中文@1 AND Z文@1) OR
       (V中@1 AND V中文@1 AND V文@1) OR
       (W中@1 AND W中文@1 AND W文@1)))

Thanks to you and the Xapian community!

djcb commented 7 years ago

@liweitianux: oh, that's nice to hear, thanks! Closing...