Add decompose compound Japanese Katakana token capability to Kuromoji [LUCENE-3921]

asfimport commented 12 years ago

Japanese morphological analyzer, Kuromoji doesn't have a capability to decompose every Japanese Katakana compound tokens to sub-tokens. It seems that some Katakana tokens can be decomposed, but it cannot be applied every Katakana compound tokens. For instance, "トートバッグ(tote bag)" and "ショルダーバッグ" don't decompose into "トートバッグ" and "ショルダーバッグ" although the IPA dictionary has "バッグ" in its entry. I would like to apply the decompose feature to every Katakana tokens if the sub-tokens are in the dictionary or add the capability to force apply the decompose feature to every Katakana tokens.

Migrated from LUCENE-3921 by Kazuaki Hiraga (@hkazuakey), updated Oct 07 2012 Environment:

Cent OS 5, IPA Dictionary, Run with "Search mdoe"

asfimport commented 12 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Hello, Kazu. Long time no see – I hope things are well!

This is very good feature request. I think this might be possible by changing how we emit unknown words, i.e. by not emitting them as greedily and giving the lattice more segmentation options. For example, if we find an unknown word トートバッグ (by regular greedy matching), we can emit

ト
トー
トート
トートバ
トートバッ
トートバッグ

in the current position. When we reach the position that starts with バッグ we'll find a known word. When the Viterbi runs, it's likely to choose トート and バッグ as its best path.

Let me have a play by looking into the lattice details and see if something like this is feasible. We are sort of hacking the model here so we also need to consider side-effects.

asfimport commented 12 years ago

Kazuaki Hiraga (@hkazuakey) (migrated from JIRA)

Hello, Christian. It's been a long time!

We really want to have that capability. As you may know, It's hard to deal with tokens that consists of two or three Katakana tokens. We want to have a good way to solve the issue more systematically rather than making a hand-made dictionary.

Looking forward to hearing from you.

asfimport commented 12 years ago

Christian Moen (@cmoen) (migrated from JIRA)

I've been experimenting with the idea outlined above and I thought I should share some very early results.

The improvement here is basically to give the compound splitting heuristic an improved ability to split unknown words that are part of compounds. Experiments I've run using using our compound splitting test cases suggest that the effect is indeed positive. The improved heuristic is able to handle some of the test case that we couldn't do earlier, but all of this requires further experimentation and validation.

I've been able to segment トートバッグ (tote bag with トート being unknown) and also ショルダーバッグ (shoulder bag) as you would like with some weight tweaks, but then it also segmented エンジニアリング (engineering) into エンジニア (engineer) リング (ring).

It might be possible to tune this up or developer a more advanced heuristic that remedies this, but I haven't had a chance to look further into this. Also, any change here would require extensive testing and validation. See the evaluation attached to #4800 that was done on Wikipedia for search mode.

Please note that there will not be time to provide improvements here for 3.6, but we can follow up on katakana segmentation for 4.0.

With the above idea for katakana in mind, I'm thinking we can skip emitting katakana words that start with ン、ッ、ー since we don't want tokens that start with these characters and consider adding this as an option to the tokenizer if it works well.

Having said this, there are real limits to what we can achieve by hacking the statistical model (and it also affects our karma, you know...). The approach above also has performance and memory impact. We'd need to introduce a fairly short limits to how long unknown words can be and this can perhaps only apply to unknown katakana words. The length restriction will be big enough to not have any practical impact on segmentation, though.

An alternative approach to all of this is to build some lexical assets. I think we'd get pretty far for katakana if we apply some of the corpus-based compound-splitting algorithms European NLP researchers have developed. Some of these algorithms are pretty simple and quite effective.

Thoughts?

asfimport commented 12 years ago

Kazuaki Hiraga (@hkazuakey) (migrated from JIRA)

Christian, Thank you for your comments and give us details about your experiments.

I think the length restriction is an acceptable option for splitting the Katakana compound token. I tried to use DictionaryCompoundWordTokenFilter for this purpose and we were able to get the similar result what we expected. But we don't want to rely on another dictionary that along with dictionaries for Japanese tokenizer. So, if possible, we would like to have such functionality in Kuromoji.

asfimport commented 12 years ago

Lance Norskog (migrated from JIRA)

I have discovered a similar problem with the Smart Chinese toolkit. Would the same approach work for both languages? Would it be worth solving this problem with a generic tool rather than language-specific?

asfimport commented 12 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Lance,

The idea I had in mind for Japanese uses language specific characteristics for katakana terms and perhaps weights that are dictionary-specific as well. However, we are hacking the our statistical model here and there are limitations as to how far we can go with this.

I don't know a whole lot about the Smart Chinese toolkit, but I believe the same approach to compound segmentation could work for Chinese as well. However, weights and implementation would likely to be separate. Note that the above is really about one specific kind of compound segmentation that applies to Japanese so the thinking was to add additional heuristics for this specific type that is particularly tricky.

It might be a good idea to approach this problem also using the DictionaryCompoundWordTokenFilter and collectively build some lexical assets for compound splitting for the relevant languages than hacking our models.

asfimport commented 12 years ago

Lance Norskog (migrated from JIRA)

Statistical models and rule-based models always have a failure rate. When you use them you have to decide what to do about the failures. Attacking the failures with another model drives toward Xeno's Paradox. For Chinese language search, breaking the failures into bigrams makes a lot of sense. The CJK bigram generator creates a massive amount of bogus bigrams. Bogus bigrams case bogus results from sloppy phrase searches.

Smart Chinese and Kuromoji are not systems for doing natural-language processing). They are systems for minimizing bogus bigrams. This allows sloppy phrase queries to find fewer bogus results. In my use case, Smart Chinese created only 2% (40k/1.8m) of the possible bigrams. https://issues.apache.org/jira/browse/SOLR-3653 is the result of my experience in supporting searching Chinese legal documents. I have some useful numbers at the end of the page.

apache / lucene

Add decompose compound Japanese Katakana token capability to Kuromoji [LUCENE-3921] #4994