KoreanTokenizer should split unknown words on digits [LUCENE-8966]

asfimport commented 5 years ago

Since #9594 the Korean tokenizer groups characters of unknown words if they belong to the same script or an inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the rest in Latin) but this rule doesn't work well on digits since they are considered common with other scripts. For instance the input "44사이즈" is kept as is even though "사이즈" is part of the dictionary. We should restore the original behavior and splits any unknown words if a digit is followed by another type.

This issue was first discovered in https://github.com/elastic/elasticsearch/issues/46365

Migrated from LUCENE-8966 by Jim Ferenczi (@jimczi), resolved Sep 13 2019 Attachments: LUCENE-8966.patch (versions: 2)

asfimport commented 5 years ago

Jim Ferenczi (@jimczi) (migrated from JIRA)

Here is a patch that breaks unknown words on digits instead of grouping them with other types.

asfimport commented 5 years ago

Lucene/Solr QA (migrated from JIRA)

✔ +1 overall

Vote	Subsystem	Runtime	Comment
Prechecks
+1	test4tests	0m 0s	The patch appears to include 1 new or modified test files.
master Compile Tests
+1	compile	0m 40s	master passed
Patch Compile Tests
+1	compile	0m 37s	the patch passed
+1	javac	0m 36s	the patch passed
+1	Release audit (RAT)	0m 36s	the patch passed
+1	Check forbidden APIs	0m 36s	the patch passed
+1	Validate source patterns	0m 36s	the patch passed
Other Tests
+1	unit	0m 43s	nori in the patch passed.
	4m 35s

Subsystem	Report/Notes
JIRA Issue	LUCENE-8966
JIRA Patch URL	https://issues.apache.org/jira/secure/attachment/12979538/LUCENE-8966.patch
Optional Tests	compile javac unit ratsources checkforbiddenapis validatesourcepatterns
uname	Linux lucene1-us-west 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool	ant
Personality	/home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
git revision	master / 78b6530fb26
ant	version: Apache Ant(TM) version 1.10.5 compiled on March 28 2019
Default Java	LTS
Test Results	https://builds.apache.org/job/PreCommit-LUCENE-Build/205/testReport/
modules	C: lucene/analysis/nori U: lucene/analysis/nori
Console output	https://builds.apache.org/job/PreCommit-LUCENE-Build/205/console
Powered by	Apache Yetus 0.7.0 http://yetus.apache.org

This message was automatically generated.

asfimport commented 5 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

isNumber() is dead code?

asfimport commented 5 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Also you are using a private isDigit() at one place, the other place uses Character.isDigit(). This should be consistent.

asfimport commented 5 years ago

Jim Ferenczi (@jimczi) (migrated from JIRA)

Thanks for looking @uschindler. These two private static functions are dead codes that I forgot to remove. The other places use Character.isDigit() consistently.

asfimport commented 5 years ago

Jim Ferenczi (@jimczi) (migrated from JIRA)

New patch without the dead code

asfimport commented 5 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

Would you consider grouping numbers and (at least some) punctuation together so that we can preserve decimals and fractions?

asfimport commented 5 years ago

Jim Ferenczi (@jimczi) (migrated from JIRA)

Would you consider grouping numbers and (at least some) punctuation together so that we can preserve decimals and fractions?

For complex number grouping and normalization, @danmuzi added a KoreanNumberFilter in #9856

It is identical to the JapaneseNumberFilter excepts that it only detects Korean hangul numbers. I don't think it handles fractions though but this could be added if needed.

asfimport commented 5 years ago

Lucene/Solr QA (migrated from JIRA)

✔ +1 overall

Vote	Subsystem	Runtime	Comment
Prechecks
+1	test4tests	0m 0s	The patch appears to include 1 new or modified test files.
master Compile Tests
+1	compile	0m 35s	master passed
Patch Compile Tests
+1	compile	0m 35s	the patch passed
+1	javac	0m 35s	the patch passed
+1	Release audit (RAT)	0m 35s	the patch passed
+1	Check forbidden APIs	0m 35s	the patch passed
+1	Validate source patterns	0m 35s	the patch passed
Other Tests
+1	unit	0m 38s	nori in the patch passed.
	4m 6s

Subsystem	Report/Notes
JIRA Issue	LUCENE-8966
JIRA Patch URL	https://issues.apache.org/jira/secure/attachment/12979562/LUCENE-8966.patch
Optional Tests	compile javac unit ratsources checkforbiddenapis validatesourcepatterns
uname	Linux lucene1-us-west 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool	ant
Personality	/home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
git revision	master / 78b6530fb26
ant	version: Apache Ant(TM) version 1.10.5 compiled on March 28 2019
Default Java	LTS
Test Results	https://builds.apache.org/job/PreCommit-LUCENE-Build/206/testReport/
modules	C: lucene/analysis/nori U: lucene/analysis/nori
Console output	https://builds.apache.org/job/PreCommit-LUCENE-Build/206/console
Powered by	Apache Yetus 0.7.0 http://yetus.apache.org

This message was automatically generated.

asfimport commented 5 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

> For complex number grouping and normalization, Namgyu Kim added a KoreanNumberFilter in #9856

Ah thanks, I'll have a look

asfimport commented 5 years ago

Namgyu Kim (@danmuzi) (migrated from JIRA)

Good job! @jimczi :D

It can be serious enough for Nori users.

About Punctuation, as @jimczi said, it can be remained by using discardPunctuation(set false) parameter in KoreanTokenizer.

You can test it by using analyzerWithPunctuation instance in TestKoreanTokenizer.

asfimport commented 5 years ago

Namgyu Kim (@danmuzi) (migrated from JIRA)

But there is a bug I just checked :(

Input : "4......4사이즈" Expected : [4] [......] [4] [사이즈] Actual : [4] [.] [.....] [4] [사이즈]

// Need to pass!
public void testDuplicatePunctuation() throws IOException {
  assertAnalyzesTo(analyzerWithPunctuation, "4......4사이즈",
      new String[]{"4", "......", "4", "사이즈"},
      new int[]{0, 1, 7, 8},
      new int[]{1, 7, 8, 11},
      new int[]{1, 1, 1, 1}
  );
}

I think we need to fix it. If it is okay to fix within this JIRA issue, I'll post additional patch. Otherwise I'll create a new one.

asfimport commented 5 years ago

Jim Ferenczi (@jimczi) (migrated from JIRA)

I don't think it's a bug @danmuzi or at least that it's related to this issue. In your example the first dot ('.' is a word dictionary) is considered a better path than grouping all dots eagerly. We process the unknown words greedily so we compare the path "[4], [.], [.....]" with "[4], [.], [.], [....]", "[4], [.], [.], [.], [...]", ... "[4], [......]". Keeping the first dot separated from the rest indicates that a number followed by a dot is a better splitting path than multiple dots in our model. We can discuss this behavior in a new issue if you think this should be configurable (for instance the JapaneseTokenizer process unknown words greedily only in search mode) ?

asfimport commented 5 years ago

Namgyu Kim (@danmuzi) (migrated from JIRA)

Oh, Thank you for your reply. @jimczi :D

I checked again and it was not bug. That result is come from viterbi path.

But I think it needs to be discussed. So I added a new issue about it.

I'd appreciate if you check #10020.

P.S. +1 to your patch

asfimport commented 5 years ago

ASF subversion and git services (migrated from JIRA)

Commit c8f36238ab0d99e2dc60b944030f90832b850737 in lucene-solr's branch refs/heads/master from jimczi https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c8f3623

LUCENE-8966: The Korean analyzer split tokens on boundaries between digits and alphabetic characters.

asfimport commented 5 years ago

ASF subversion and git services (migrated from JIRA)

Commit c4815f04c06256dbc9b28afdeb8b9c689198fd7d in lucene-solr's branch refs/heads/branch_8x from jimczi https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c4815f0

LUCENE-8966: The Korean analyzer now splits tokens on boundaries between digits and alphabetic characters.

asfimport commented 5 years ago

ASF subversion and git services (migrated from JIRA)

Commit ec1ef2bce652686416a6b1d9f57da7805fb1f93c in lucene-solr's branch refs/heads/master from jimczi https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=ec1ef2b

LUCENE-8966: update CHANGES.txt after backport

asfimport commented 2 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Closing after the 9.0.0 release

apache / lucene

KoreanTokenizer should split unknown words on digits [LUCENE-8966] #10009