How Nori Tokenizer can deal with Longest-Matching [LUCENE-8631]

asfimport commented 5 years ago

I think... Nori tokenizer has one issue.

I don’t understand why “Longest-Matching” is NOT working to Nori tokenizer via config mode (config mode: https://www.elastic.co/guide/en/elasticsearch/plugins/6.x/analysis-nori-tokenizer.html).

Here is an example for explaining what is longest-matching.

Let assume we have userdict_ko.txt including only three Korean single-words such as ‘골드’, ‘브라운’, ‘골드브라운’, and save it to Nori analyzer. After update, we can see that it outputs two tokens such as ‘골드’ and ‘브라운’, when the input is ‘골드브라운’. (In English: ‘골드’ means ‘gold’, ‘브라운’ means ‘brown’, and ‘골드브라운’ means ‘goldbrown’)

With this result, we recognize that “Longest-Matching” is NOT working. If “Longest-Matching” is working, the output must be ‘골드브라운’, which is the longest matching word in the user dictionary.

Curiously enough, when we add user dictionary via custom mode (custom mode: https://github.com/jimczi/nori/blob/master/how-to-custom-dict.asciidoc), we found the result is ‘골드브라운’, where ‘Longest-Matching’ is applied. We think the reason is because learned Mecab engine automatically generates word costs by its own criteria. We hope this mechanism is also applied to config mode.

Would you tell me the way to “Longest-Matching” via config mode (not custom) or give me some hints (e.g. where to modify source codes) to solve this problem?

P.S

Recently, I've mailed to @jimczi, who is a developer of Nori, and received his suggestions:

- Add a way to set a score to each new rule (this way you could set up a negative cost for the compound word that is less than the sum of the two single words.

- Same as above but the cost is computed from the statistics of the training (like the custom dictionary does when you recompile entirely).

- Implement longest-match first in the dictionary.

Thanks for your support.

Migrated from LUCENE-8631 by Yeongsu Kim (@gritmind), resolved Mar 12 2019

asfimport commented 5 years ago

Jim Ferenczi (@jimczi) (migrated from JIRA)

Thanks for reporting @gritmind. Since we give the same cost to all user words I think the easiest way to solve this issue would be to implement a longest-only match in the user dictionary. We don't check the main dictionary when we have matches in the user dictionary so this should only ensure that the longest rule that matches wins. This should also speed up the tokenization since we'd add a single path in the lattice (instead of all user words that match). I'll work on a patch.

asfimport commented 5 years ago

ASF subversion and git services (migrated from JIRA)

Commit b1f870a4164769df62b24af63048aa2f9b21af47 in lucene-solr's branch refs/heads/master from Yeongsu Kim https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=b1f870a

LUCENE-8631: The Korean user dictionary now picks the longest-matching word and discards the other matches.

asfimport commented 5 years ago

ASF subversion and git services (migrated from JIRA)

Commit 8d0652451ea4ed9d0285fb5f8c7568c058c6730b in lucene-solr's branch refs/heads/branch_8x from Yeongsu Kim https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=8d06524

LUCENE-8631: The Korean user dictionary now picks the longest-matching word and discards the other matches.

asfimport commented 5 years ago

Jim Ferenczi (@jimczi) (migrated from JIRA)

Thanks @gritmind!

asfimport commented 2 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Closing after the 9.0.0 release

apache / lucene

How Nori Tokenizer can deal with Longest-Matching [LUCENE-8631] #9677