apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.58k stars 1.01k forks source link

Improved Kuromoji search mode segmentation/decompounding [LUCENE-3730] #4804

Closed asfimport closed 12 years ago

asfimport commented 12 years ago

Kuromoji has a segmentation mode for search that uses a heuristic to promote additional segmentation of long candidate tokens to get a decompounding effect. This heuristic has been improved. Patch is coming up.


Migrated from LUCENE-3730 by Christian Moen (@cmoen), resolved Feb 01 2012 Attachments: LUCENE-3730_trunk.patch

asfimport commented 12 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Find attached a patch for trunk that improves the heuristic. Search segmentation tests/examples are in search-segmentation-tests.txt and is validated by TestSearchMode.

Note that both the tests and the heuristic is tuned for IPADIC. Hence, we need to revisit this when we add support for other dictionaries/models.

I've also moved the ASF license header in TestExtendedMode.java to the right place.

asfimport commented 12 years ago

Christian Moen (@cmoen) (migrated from JIRA)

If you want to try the new search mode, there's a simple Kuromoji web interface available on http://atilika.org/kuromoji that perhaps is useful. After inputing some text and pressing enter, click "normal mode" to switch to "search mode" to test the various segmentation modes for the given input.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Patch looks good to me... so the basics are we apply a different penalty based on whether the text is kanji or not, rather than just a single penalty of 10000 (and some parameter tuning) ?

Note that both the tests and the heuristic is tuned for IPADIC. Hence, we need to revisit this when we add support for other dictionaries/models.

I think this is ok for now. Long term (if there end out being different values for other dictionaries), we can conditionalize these on dictionary type: either at build-time (recording these values into dictionary), or better, record the dictionary type itself and conditionalize these at run-time based on dictionary type.

By recording the type, we would also be able to use e.g. assumeTrue(dictionaryType == IPADIC) in unit tests and things like that, and who knows what else, but lets not worry about it here.

asfimport commented 12 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Patch looks good to me... so the basics are we apply a different penalty based on whether the text is kanji or not, rather than just a single penalty of 10000 (and some parameter tuning) ?

Thanks a lot, Robert. That's correct.

I agree completely regarding other dictionary support.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thanks Christian!