Default KuromojiAnalyzer to use search mode [LUCENE-3726]

asfimport commented 12 years ago

Kuromoji supports an option to segment text in a way more suitable for search, by preventing long compound nouns as indexing terms.

In general 'how you segment' can be important depending on the application (see http://nlp.stanford.edu/pubs/acl-wmt08-cws.pdf for some studies on this in chinese)

The current algorithm punishes the cost based on some parameters (SEARCH_MODE_PENALTY, SEARCH_MODE_LENGTH, etc) for long runs of kanji.

Some questions (these can be separate future issues if any useful ideas come out):

should these parameters continue to be static-final, or configurable?
should POS also play a role in the algorithm (can/should we refine exactly what we decompound)?
is the Tokenizer the best place to do this, or should we do it in a tokenfilter? or both? with a tokenfilter, one idea would be to also preserve the original indexing term, overlapping it: e.g. ABCD -> AB, CD, ABCD(posInc=0) from my understanding this tends to help with noun compounds in other languages, because IDF of the original term boosts 'exact' compound matches. but does a tokenfilter provide the segmenter enough 'context' to do this properly?

Either way, I think as a start we should turn on what we have by default: its likely a very easy win.

Migrated from LUCENE-3726 by Robert Muir (@rmuir), resolved Feb 05 2012 Attachments: kuromojieval.tar.gz, LUCENE-3726.patch (versions: 3)

asfimport commented 12 years ago

Christian Moen (@cmoen) (migrated from JIRA)

These are very interesting questions, Robert. Please find my comments below.

should these parameters continue to be static-final, or configurable?

It's perhaps possible to make these configurable, but I think we'd be exposing configuration that is most likely to confuse most users rather than help them.

The values currently uses have been found using some analysis and experimentation, and they can probably be improved both in terms of tuning and with added heuristics – in particular for katakana compounds (more below).

However, changing and improving this requires quite detailed analysis and testing, though. I think the major case for exposing them is as a means for easily tuning them rather than these parameters being generally useful to users.

should POS also play a role in the algorithm (can/should we refine exactly what we decompound)?

Very good question and an interesting idea.

In the case of long kanji words such as 関西国際空港 (Kansai International Airport), which is a known noun, we can possible use POS info as a hint for applying the Viterbi penalty. In the case of unknown kanji, Kuromoji unigrams them. (関西国際空港 becomes 関西国際空港 (Kansai International Airport) using search mode.)

Katakana compounds such as シニアソフトウェアエンジニア (senior software engineer) becomes one token without search mode, but when search mode is used, we get three tokens シニアソフトウェアエンジニア as you would expect. It's also the case that シニアソフトウェアエンジニア is an unknown word, but its constituents become known and get the correct POS after search mode.

In general, unknown words get a noun-POS (名詞) so the idea of using POS here should be fine.

There are some problems with the katakana decompounding in search mode. For example, コニカミノルタホールディングス (Konika Minolta Holdings) becomes コニカミノルタホールディングス (Konika Minolta horu dings), where we get the additional token ホール (also means hall, in Japanese).

To sum up, I think we can potentially use the noun-POS as a hint when doing the decompounding in search mode, but I'm not sure how much we will benefit from it, but I like the idea. I think we'll benefit most from an improved heuristic for non-kanji to improve katakana decompounding.

Let me have a tinker and see how I can improve this.

is the Tokenizer the best place to do this, or should we do it in a tokenfilter? or both?

Interesting idea and good point regarding IDF.

In order do the decompoundning, we'll need access to the lattice and add entries to it before we run the Viterbi. If we do normal segmentation first then run a decompounding filter, I think we'll need to run the Viterbi twice in order to get the desired results. (Optimizations are possible, though.)

I'm thinking a possibility could be to expose possible decompounds as part of Kuromoji's Token interface. We can potentially have something like

/**
 * Returns a list of possible decompounds for this token found by a heuristic
 * 
 * `@return` a list of candidate decompounds or null of none is found
 */
List<Token> getDecompounds() {
  // ...
}

In the case of シニアソフトウェアエンジニア, the current token would have surface form シニアソフトウェアエンジニア, but with tokens シニア, ソフトウェア and エンジニア accessible using getDecompounds().

As a general notice, I should point our that how well the heuristics performs depends on the dictionary/statistical model used (i.e. IPADIC) and if we might want to make different heuristics for each of those we support as needed.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I'm thinking a possibility could be to expose possible decompounds as part of Kuromoji's Token interface.

I like this idea: I think it would give the most flexibility, we would populate some attribute from Token just like we do today for other attributes, and then actual indexing of compounds can be controlled with a configurable tokenfilter.

Long term, this lets the tokenizer stay a tokenizer and prevents it from growing too complex.

asfimport commented 12 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Thanks for the feedback.

I'm working on tuning the heuristics to improve accuracy of katakana segmentation in search mode.

I'll keep you posted on results and a patch. Unit tests will document the cases.

asfimport commented 12 years ago

Christian Moen (@cmoen) (migrated from JIRA)

I've improved the heuristic and submitted a patch to #4804, which covers the issue.

We can now deal with cases such as コニカミノルタホールディングス and many others just fine. The former becomes コニカミノルタホールディングス as we'd like.

I think we should apply #4804 before changing any defaults – and also independently of changing any defaults. I think we should also make sure that the default we use for Lucene is consistent with the Solr's default in schema.xml for text_ja.

I'll do additional tests on a Japanese corpus and provide feedback, and we can use this as a basis for how to follow up. Hopefully, we'll have sufficient and good data to conclude on this.

asfimport commented 12 years ago

Christian Moen (@cmoen) (migrated from JIRA)

I've segmented some Japanese Wikipedia text into sentences (using a naive sentence segmenter) and then segmented each sentence using both normal and search mode with the Kuromoji on Github that has #4804 applied. Segmentation with Kuromoji in Lucene should be similar overall (modulo some differences in punctuation handling).

Search mode and normal mode segmentation match completely in 90.7% of the sentences segmented and there's a 99.6% match at the token level (when counting normal mode tokens).

Find attached some HTML files with a total of 10,000 sentences that demonstrates the differences in segmentation.

Overall, I think search mode does a decent job. I've written someone else doing Japanese NLP to get their second opinion, in particular if the kanji splitting should be made somewhat less eager to split three letter words.

asfimport commented 12 years ago

Christian Moen (@cmoen) (migrated from JIRA)

The latest attached patch introduces a default mode in Segmenter, which is now Mode.SEARCH.

This mode is used by KuromojiAnalyzer in Lucene without further code changes. The Solr factory duplicated the default mode, but now retrieves it from Segmenter. This way, we set the default mode for both Solr and Lucene in a single place (in Segmenter), which I find cleaner.

I've also moved some constructors around in Segmenter and did some minor formatting/style changes.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thanks Christian: I committed this.

apache / lucene

Default KuromojiAnalyzer to use search mode [LUCENE-3726] #4800