WorksApplications / SudachiPy

Python version of Sudachi, a Japanese tokenizer.
Apache License 2.0
392 stars 50 forks source link

OOV flag is not properly set #105

Closed sorami closed 5 years ago

sorami commented 5 years ago

Everything becomes OOV;

$ echo 徳島Sudachi | sudachipy -a
徳島    名詞,固有名詞,地名,一般,*,*     徳島    徳島    トクシマ        0       (OOV)
Sudachi 名詞,普通名詞,一般,*,*,*        sudachi sudachi         -1      (OOV)
EOS
>>> from sudachipy import tokenizer, dictionary
>>> tokenizer_obj = dictionary.Dictionary().create()
>>> [(m.surface(), m.dictionary_id(), m.is_oov()) for m in tokenizer_obj.tokenize("徳島Sudachi")]
[('徳島', 0, <bound method LatticeNode.is_oov of <sudachipy.latticenode.LatticeNode object at 0x7f18280f4c50>>), ('Sudachi', -1, <bound method LatticeNode.is_oov of <sudachipy.latticenode.LatticeNode object at 0x7f182807e590>>)]
>>>