apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.63k stars 1.03k forks source link

CJKAnalyzer not matching mutlibyte character followed by non-multibyte character [LUCENE-2673] #3747

Open asfimport opened 14 years ago

asfimport commented 14 years ago

Here is a listing of text indexed in a field, followed by various search terms that did or did not match the document.

[QES様文字化けテスト] QES -> retrievable QES様 -> not retrievable QES様文字化けテスト -> retrievable

[SOA基盤] SOA ->retrievable SOA基 -> not retrievable SOA基盤 -> retrievable

[日経BP] 日経 -> retrievable 日経B -> not retrievable 日経BP -> retrievable


Migrated from LUCENE-2673 by Kevin Hayen, 1 vote, updated May 16 2011

asfimport commented 14 years ago

Koji Sekiguchi (@kojisekig) (migrated from JIRA)

I think CJKAnalyzer works as expected.

QES様 -> not retrievable SOA基 -> not retrievable

Because CJK chars are tokenized 2-gram, "様" and "基" are not token.

日経B -> not retrievable

Because non CJK chars are tokenized at white space, "B" is not token.

asfimport commented 14 years ago

Kevin Hayen (migrated from JIRA)

That is the current behavior however, after checking with our Japanese office, I have confirmed that it is a common occurrence for western and Asian characters to be placed side by side. So the current behavior does not match what the user will expect.