Closed asfimport closed 5 years ago
Jim Ferenczi (@jimczi) (migrated from JIRA)
Here is a patch that breaks unknown words on digits instead of grouping them with other types.
Lucene/Solr QA (migrated from JIRA)
✔ +1 overall |
---|
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
Prechecks | |||
+1 | test4tests | 0m 0s | The patch appears to include 1 new or modified test files. |
master Compile Tests | |||
+1 | compile | 0m 40s | master passed |
Patch Compile Tests | |||
+1 | compile | 0m 37s | the patch passed |
+1 | javac | 0m 36s | the patch passed |
+1 | Release audit (RAT) | 0m 36s | the patch passed |
+1 | Check forbidden APIs | 0m 36s | the patch passed |
+1 | Validate source patterns | 0m 36s | the patch passed |
Other Tests | |||
+1 | unit | 0m 43s | nori in the patch passed. |
4m 35s |
Subsystem | Report/Notes |
---|---|
JIRA Issue | LUCENE-8966 |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12979538/LUCENE-8966.patch |
Optional Tests | compile javac unit ratsources checkforbiddenapis validatesourcepatterns |
uname | Linux lucene1-us-west 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | ant |
Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh |
git revision | master / 78b6530fb26 |
ant | version: Apache Ant(TM) version 1.10.5 compiled on March 28 2019 |
Default Java | LTS |
Test Results | https://builds.apache.org/job/PreCommit-LUCENE-Build/205/testReport/ |
modules | C: lucene/analysis/nori U: lucene/analysis/nori |
Console output | https://builds.apache.org/job/PreCommit-LUCENE-Build/205/console |
Powered by | Apache Yetus 0.7.0 http://yetus.apache.org |
This message was automatically generated.
Uwe Schindler (@uschindler) (migrated from JIRA)
isNumber() is dead code?
Uwe Schindler (@uschindler) (migrated from JIRA)
Also you are using a private isDigit() at one place, the other place uses Character.isDigit(). This should be consistent.
Jim Ferenczi (@jimczi) (migrated from JIRA)
Thanks for looking @uschindler. These two private static functions are dead codes that I forgot to remove. The other places use Character.isDigit() consistently.
Jim Ferenczi (@jimczi) (migrated from JIRA)
New patch without the dead code
Michael Sokolov (@msokolov) (migrated from JIRA)
Would you consider grouping numbers and (at least some) punctuation together so that we can preserve decimals and fractions?
Jim Ferenczi (@jimczi) (migrated from JIRA)
Would you consider grouping numbers and (at least some) punctuation together so that we can preserve decimals and fractions?
For complex number grouping and normalization, @danmuzi added a KoreanNumberFilter in #9856
It is identical to the JapaneseNumberFilter excepts that it only detects Korean hangul numbers. I don't think it handles fractions though but this could be added if needed.
Lucene/Solr QA (migrated from JIRA)
✔ +1 overall |
---|
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
Prechecks | |||
+1 | test4tests | 0m 0s | The patch appears to include 1 new or modified test files. |
master Compile Tests | |||
+1 | compile | 0m 35s | master passed |
Patch Compile Tests | |||
+1 | compile | 0m 35s | the patch passed |
+1 | javac | 0m 35s | the patch passed |
+1 | Release audit (RAT) | 0m 35s | the patch passed |
+1 | Check forbidden APIs | 0m 35s | the patch passed |
+1 | Validate source patterns | 0m 35s | the patch passed |
Other Tests | |||
+1 | unit | 0m 38s | nori in the patch passed. |
4m 6s |
Subsystem | Report/Notes |
---|---|
JIRA Issue | LUCENE-8966 |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12979562/LUCENE-8966.patch |
Optional Tests | compile javac unit ratsources checkforbiddenapis validatesourcepatterns |
uname | Linux lucene1-us-west 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | ant |
Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh |
git revision | master / 78b6530fb26 |
ant | version: Apache Ant(TM) version 1.10.5 compiled on March 28 2019 |
Default Java | LTS |
Test Results | https://builds.apache.org/job/PreCommit-LUCENE-Build/206/testReport/ |
modules | C: lucene/analysis/nori U: lucene/analysis/nori |
Console output | https://builds.apache.org/job/PreCommit-LUCENE-Build/206/console |
Powered by | Apache Yetus 0.7.0 http://yetus.apache.org |
This message was automatically generated.
Michael Sokolov (@msokolov) (migrated from JIRA)
> For complex number grouping and normalization, Namgyu Kim added a KoreanNumberFilter in #9856
Ah thanks, I'll have a look
Namgyu Kim (@danmuzi) (migrated from JIRA)
Good job! @jimczi :D
It can be serious enough for Nori users.
About Punctuation, as @jimczi said, it can be remained by using discardPunctuation(set false) parameter in KoreanTokenizer.
You can test it by using analyzerWithPunctuation instance in TestKoreanTokenizer.
Namgyu Kim (@danmuzi) (migrated from JIRA)
But there is a bug I just checked :(
Input : "4......4사이즈" Expected : [4] [......] [4] [사이즈] Actual : [4] [.] [.....] [4] [사이즈]
// Need to pass!
public void testDuplicatePunctuation() throws IOException {
assertAnalyzesTo(analyzerWithPunctuation, "4......4사이즈",
new String[]{"4", "......", "4", "사이즈"},
new int[]{0, 1, 7, 8},
new int[]{1, 7, 8, 11},
new int[]{1, 1, 1, 1}
);
}
I think we need to fix it. If it is okay to fix within this JIRA issue, I'll post additional patch. Otherwise I'll create a new one.
Jim Ferenczi (@jimczi) (migrated from JIRA)
I don't think it's a bug @danmuzi or at least that it's related to this issue. In your example the first dot ('.' is a word dictionary) is considered a better path than grouping all dots eagerly. We process the unknown words greedily so we compare the path "[4], [.], [.....]" with "[4], [.], [.], [....]", "[4], [.], [.], [.], [...]", ... "[4], [......]". Keeping the first dot separated from the rest indicates that a number followed by a dot is a better splitting path than multiple dots in our model. We can discuss this behavior in a new issue if you think this should be configurable (for instance the JapaneseTokenizer process unknown words greedily only in search mode) ?
Namgyu Kim (@danmuzi) (migrated from JIRA)
Oh, Thank you for your reply. @jimczi :D
I checked again and it was not bug. That result is come from viterbi path.
But I think it needs to be discussed. So I added a new issue about it.
I'd appreciate if you check #10020.
P.S. +1 to your patch
ASF subversion and git services (migrated from JIRA)
Commit c8f36238ab0d99e2dc60b944030f90832b850737 in lucene-solr's branch refs/heads/master from jimczi https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c8f3623
LUCENE-8966: The Korean analyzer split tokens on boundaries between digits and alphabetic characters.
ASF subversion and git services (migrated from JIRA)
Commit c4815f04c06256dbc9b28afdeb8b9c689198fd7d in lucene-solr's branch refs/heads/branch_8x from jimczi https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c4815f0
LUCENE-8966: The Korean analyzer now splits tokens on boundaries between digits and alphabetic characters.
ASF subversion and git services (migrated from JIRA)
Commit ec1ef2bce652686416a6b1d9f57da7805fb1f93c in lucene-solr's branch refs/heads/master from jimczi https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=ec1ef2b
LUCENE-8966: update CHANGES.txt after backport
Adrien Grand (@jpountz) (migrated from JIRA)
Closing after the 9.0.0 release
Since #9594 the Korean tokenizer groups characters of unknown words if they belong to the same script or an inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the rest in Latin) but this rule doesn't work well on digits since they are considered common with other scripts. For instance the input "44사이즈" is kept as is even though "사이즈" is part of the dictionary. We should restore the original behavior and splits any unknown words if a digit is followed by another type.
This issue was first discovered in https://github.com/elastic/elasticsearch/issues/46365
Migrated from LUCENE-8966 by Jim Ferenczi (@jimczi), resolved Sep 13 2019 Attachments: LUCENE-8966.patch (versions: 2)