atilika / kuromoji

Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search
Apache License 2.0
950 stars 131 forks source link

Debug graph for multi-tokenization #114

Open emmanuellegedin opened 7 years ago

emmanuellegedin commented 7 years ago

Overview

Adds the function debugMultiTokenize() similar to the previously existing debugTokenize(), but with support for multi-tokenization. The function generates a graph in DOT format.

Details

Each tokenization corresponds to a path in the graph. We assign a color to each such path and color the edges accordingly. If an edge is included in more than one path it will have more than one color.

This feature also adds a legend to the graph to show which path corresponds to which color.

Screenshots

screen shot 2016-12-15 at 23 04 16 screen shot 2016-12-18 at 17 35 45 screen shot 2016-12-18 at 17 36 36

Possible Issues

There are a few issues that I would be happy to get opinions on.

Colors

The colors are generated by selecting equidistant angles in the HSB color model, starting from the green color which was previously used in the debugTokenize() function.

Pros

Cons

Legend

As far as I know DOT does not have a simple way to make legends. The one being used right now is made as a custom subgraph cluster. By letting DOT handle positions and lengths of edges, I think the legend ends up being a bit unnecessarily wide. Maybe there is a better way to create it.

The legend is placed in the bottom left, which might not be ideal.