Closed asfimport closed 12 years ago
Christian Moen (@cmoen) (migrated from JIRA)
Find attached a patch for trunk
that improves the heuristic. Search segmentation tests/examples are in search-segmentation-tests.txt
and is validated by TestSearchMode
.
Note that both the tests and the heuristic is tuned for IPADIC. Hence, we need to revisit this when we add support for other dictionaries/models.
I've also moved the ASF license header in TestExtendedMode.java
to the right place.
Christian Moen (@cmoen) (migrated from JIRA)
If you want to try the new search mode, there's a simple Kuromoji web interface available on http://atilika.org/kuromoji that perhaps is useful. After inputing some text and pressing enter, click "normal mode" to switch to "search mode" to test the various segmentation modes for the given input.
Robert Muir (@rmuir) (migrated from JIRA)
Patch looks good to me... so the basics are we apply a different penalty based on whether the text is kanji or not, rather than just a single penalty of 10000 (and some parameter tuning) ?
Note that both the tests and the heuristic is tuned for IPADIC. Hence, we need to revisit this when we add support for other dictionaries/models.
I think this is ok for now. Long term (if there end out being different values for other dictionaries), we can conditionalize these on dictionary type: either at build-time (recording these values into dictionary), or better, record the dictionary type itself and conditionalize these at run-time based on dictionary type.
By recording the type, we would also be able to use e.g. assumeTrue(dictionaryType == IPADIC) in unit tests and things like that, and who knows what else, but lets not worry about it here.
Christian Moen (@cmoen) (migrated from JIRA)
Patch looks good to me... so the basics are we apply a different penalty based on whether the text is kanji or not, rather than just a single penalty of 10000 (and some parameter tuning) ?
Thanks a lot, Robert. That's correct.
I agree completely regarding other dictionary support.
Robert Muir (@rmuir) (migrated from JIRA)
Thanks Christian!
Kuromoji has a segmentation mode for search that uses a heuristic to promote additional segmentation of long candidate tokens to get a decompounding effect. This heuristic has been improved. Patch is coming up.
Migrated from LUCENE-3730 by Christian Moen (@cmoen), resolved Feb 01 2012 Attachments: LUCENE-3730_trunk.patch