apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.62k stars 1.02k forks source link

Add katakana stem filter to better deal with certain katakana spelling variants [LUCENE-3901] #4974

Closed asfimport closed 12 years ago

asfimport commented 12 years ago

Many Japanese katakana words end in a long sound that is sometimes optional.

For example, パーティー and パーティ are both perfectly valid for "party". Similarly we have センター and センタ that are variants of "center" as well as サーバー and サーバ for "server".

I'm proposing that we add a katakana stemmer that removes this long sound if the terms are longer than a configurable length. It's also possible to add the variant as a synonym, but I think stemming is preferred from a ranking point of view.


Migrated from LUCENE-3901 by Christian Moen (@cmoen), resolved Mar 24 2012 Attachments: LUCENE-3901.patch (versions: 3)

asfimport commented 12 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Patch for this coming up shortly.

asfimport commented 12 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Find attached a patch for this.

The stemming is done by KuromojiKatakanaStemFilter, which has been added to KuromojiAnalyzer and a corresponding KuromojiKatakanaStemFilterFactory has been added to the text_ja field type in schema.xml.

Note that this stemming is now turned on by default and I think it makes good sense to do so. The minimum length of a token considered for stemming is configurable and I've made the default of 4 explicit in schema.xml to convey that it's there.

The stemmer only supports full-width katakana and should be used in combination with a CJKWidthFilter if stemming half-width characters is required and you're doing your wiring. Both text_ja and KuromojiAnalyzer takes care of this, and the default overall processing is the same.

There are some test cases in TestKuromojiKatakanaStemFilter, but I've added a case to TestKuromojiAnalyzer that demonstrates how the stemming works in combination with katakana compound splitting.

In Japanese, "manager" can be written both as マネージャー and マネージャ (and probably also マネジャー), and for the compound シニアプロジェクトマネージャー (senior project manager), we now get tokens シニア (senior) プロジェクト (project) マネージャ (manager), and we've stemmed the last token by removing the trailing ー. Kuromoji also makes the compound シニアプロジェクトマネージャ a synonym to シニア, and ー is also removed for the synonym compound.

Tests pass and I've also tested this end-to-end in a Solr trunk build.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

patch looks great!

asfimport commented 12 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Thanks a lot, Robert.

I'll do some more testing and hopefully I can commit this to trunk and branch_3x tomorrow.

asfimport commented 12 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Updated patch with minor whitespace changes to schema.xml and added an entry in CHANGES.txt.

asfimport commented 12 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Committed revision 1304719 on trunk. Backporting to branch_3x.

asfimport commented 12 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Committed revision 1304727 on branch_3x. Fixed a small javadoc issue in revisions 1304728 and 1304741.