atilika / kuromoji

Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search
Apache License 2.0
950 stars 131 forks source link

Feature n best search #106

Closed emmanuellegedin closed 8 years ago

emmanuellegedin commented 8 years ago

Overview

Adds the feature to find multiple possible tokenizations instead of just the best one. The tokenizations are returned in ascending order of cost. The user can choose to get up to the n tokenizations with lowest cost, get all tokenizations that are within a specified cost slack of the optimal cost, or a combination of the two.

Features

The added features can be accessed from the following functions in the TokenizerBase class.

multiTokenize(String text, int maxCount, int costSlack)

Get up to maxCount tokenizations with cost at most OPT + costSlack, where OPT is the optimal cost. The tokenizations are ordered by cost in ascending order.

multiTokenizeNBest(String text, int n)

Get the n tokenizations with the lowest costs. If there are less than n unique tokenizations, all possible tokenizations are returned. The tokenizations are ordered by cost in ascending order.

multiTokenizeBySlack(String text, int costSlack)

Get all tokenizations with cost at most OPT + costSlack, where OPT is the optimal cost. The tokenizations are ordered by cost in ascending order.

Unsupported Features/Known Issues

The option to split the text into sentences before tokenizing is currently not supported. This can result in a lot of memory being used when tokenizing long texts.

cmoen commented 8 years ago

Thanks a lot for this, Emanuel! I'll merge this.