Segmentation wrong with token contains square brackets?

atilika / kuromoji

Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search

Apache License 2.0

950 stars 131 forks source link

Segmentation wrong with token contains square brackets? #108

Open reckart opened 8 years ago

reckart commented 8 years ago

Looks like the segmenter does not work properly if there are square brackets, e.g.:

[   名詞,サ変接続,*,*,*,*,*,*,*
滧 名詞,一般,*,*,*,*,*,*,*
の 助詞,連体化,*,*,*,*,の,ノ,ノ
]。    名詞,サ変接続,*,*,*,*,*,*,*

「 記号,括弧開,*,*,*,*,「,「,「
国宝  名詞,一般,*,*,*,*,国宝,コクホウ,コクホー
五 名詞,数,*,*,*,*,五,ゴ,ゴ
城 名詞,一般,*,*,*,*,城,シロ,シロ
」[    名詞,サ変接続,*,*,*,*,*,*,*
``

cmoen commented 8 years ago

I agree that it might be more useful to split ]。 into ] and 。, but this is actually how the dictionary assets we are using have been designed, but perhaps it might make sense to change some of this. I have some ideas I'd like to try out...

mharn commented 6 years ago

Just jumping in to say that this outputs highlights a problem with Mecab-ipadic - symbols such as the [] here are treated as 名詞・サ変接続.

see fix for problem here: https://github.com/taku910/mecab/pull/37

cmoen commented 6 years ago

Thanks. We could also do this using a user-defined unk definition...