Closed asfimport closed 2 years ago
Tomoko Uchida (@mocobeta) (migrated from JIRA)
Thanks Uihyun Kim, this looks all fine to me. I will commit it in soon.
lucene $ ./gradlew -p lucene/analysis/nori/ regenerate
BUILD SUCCESSFUL in 18s
lucene $ ./gradlew -p lucene/analysis/nori/ test
BUILD SUCCESSFUL in 3s
ASF subversion and git services (migrated from JIRA)
Commit 76c9fd4e38af65d28236d5a2695348fbd8fa3ed8 in lucene's branch refs/heads/main from Tomoko Uchida https://gitbox.apache.org/repos/asf?p=lucene.git;h=76c9fd4
LUCENE-10416: Update Korean Dictionary to mecab-ko-dic-2.1.1-20180720 for Nori
Tomoko Uchida (@mocobeta) (migrated from JIRA)
There is no CHANGES entry in your patch, I added this one for 9.1.0.
ASF subversion and git services (migrated from JIRA)
Commit b2b35964663bfbf2063884d7dcda6818d5b590e1 in lucene's branch refs/heads/branch_9x from Tomoko Uchida https://gitbox.apache.org/repos/asf?p=lucene.git;h=b2b3596
LUCENE-10416: Update Korean Dictionary to mecab-ko-dic-2.1.1-20180720 for Nori
Uwe Schindler (@uschindler) (migrated from JIRA)
I have one question: if you have indexed text using the Korean analyzer - do you need to reindex or is it "mostly fine"?
The problem is if tokens are generated with different rules or normalization, you won't find them in index anymore.
In older Lucene versions we had "matchVersion" parameter for this, but this would require to ship with both dictionaries.
If there are significant changes we should ship this only in 10.0, not with version 9.1.
Tomoko Uchida (@mocobeta) (migrated from JIRA)
Right. I haven't considered it.
I can't estimate the impact of the change - I don't understand Korean.
Uihyun Kim I'd like to hear your thoughts on it. Should we ship this change with 9.1, or with 10.0? If full reindexing is recommended to adopt this change we will have to delay it until 10.0.
ASF subversion and git services (migrated from JIRA)
Commit c22d6d09d9b9b9d44fd88e886ed3105c5a927a63 in lucene's branch refs/heads/branch_9x from Tomoko Uchida https://gitbox.apache.org/repos/asf?p=lucene.git;h=c22d6d0
Revert "LUCENE-10416: Update Korean Dictionary to mecab-ko-dic-2.1.1-20180720 for Nori"
This reverts commit b2b35964663bfbf2063884d7dcda6818d5b590e1.
ASF subversion and git services (migrated from JIRA)
Commit f8040d565fc25c6b7388d9300c2cc890315bc9cd in lucene's branch refs/heads/main from Tomoko Uchida https://gitbox.apache.org/repos/asf?p=lucene.git;h=f8040d5
LUCENE-10416: move changes entry to v10.0.0
Tomoko Uchida (@mocobeta) (migrated from JIRA)
I'd revert it from the 9x branch since I can't estimate the impact. It'd be easy to backport this again to 9x. Let me know if you'd like to have this in 9.1.
Uihyun Kim (@uihyun) (migrated from JIRA)
@mocobeta @uschindler Thank you for reviewing this. It will give more precise outputs from the Korean tokenizer. The overall outputs must be the same since words' costs used for FST aren't changed. But to adopt this change fully without any potential issues, full reindexing or new indexing would be recommended. It seems good to be considered as a major change.
Uwe Schindler (@uschindler) (migrated from JIRA)
Thank you for confirming. We applied the change only for version 10.0. Version 9.1 coming soon will not have the new dictionary.
For Nori - Korean analyzer, there is Korean dictionary named mecab-ko-dic, which is available under an Apache license here: https://bitbucket.org/eunjeon/mecab-ko-dic
The dictionary hasn't been updated in Nori although it has some updates to provide better analysis results. Downloading is available here: https://bitbucket.org/eunjeon/mecab-ko-dic/downloads
There are changes between the currently used version and the latest release version(change log: https://bitbucket.org/eunjeon/mecab-ko-dic/src/master/CHANGES.md)
There's no issue with testing :lucene:analysis:nori:test and building a new binary.
Migrated from LUCENE-10416 by Uihyun Kim (@uihyun), 1 vote, resolved Feb 20 2022 Attachments: LUCENE-10416.patch