WorksApplications / SudachiPy

Python version of Sudachi, a Japanese tokenizer.
Apache License 2.0
391 stars 50 forks source link

AttributeError: 'sudachipy.latticenode.LatticeNode' object has no attribute 'begin' #133

Closed hiroshi-matsuda-rit closed 4 years ago

hiroshi-matsuda-rit commented 4 years ago

This error might be related to the cythonization. @polm @sorami Do you have the test cases for this API?

  File "/mnt/c/git/spaCy/venv.wsl/lib/python3.8/site-packages/sudachipy/morpheme.py", line 56, in split
    return self.list.split(mode, self.index, wi)
  File "/mnt/c/git/spaCy/venv.wsl/lib/python3.8/site-packages/sudachipy/morphemelist.py", line 75, in split
    n.begin = offset
AttributeError: 'sudachipy.latticenode.LatticeNode' object has no attribute 'begin'
sorami commented 4 years ago

I had similar cases while investigating #128.

Sorry, no, there were no test cases for this method.

I think the error is because, with Cythonization, you don't have direct access to attributes, i.e., it should be n.set_begin() instead (this method already exists).

There may be more such cases, which the current test cases didn't catch.

sorami commented 4 years ago
from sudachipy import tokenizer
from sudachipy import dictionary

tokenizer_obj = dictionary.Dictionary().create()

mode = tokenizer.Tokenizer.SplitMode.C
morpheme = tokenizer_obj.tokenize("国家公務員", mode)[0]
morpheme.surface() # '国家公務員'

morpheme.split(tokenizer.Tokenizer.SplitMode.A)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-25-af36be3916ed> in <module>
----> 1 morpheme.split(tokenizer.Tokenizer.SplitMode.A)

SudachiPy/sudachipy/morpheme.py in split(self, mode)
     54     def split(self, mode):
     55         wi = self.get_word_info()
---> 56         return self.list.split(mode, self.index, wi)
     57
     58     def is_oov(self):

SudachiPy/sudachipy/morphemelist.py in split(self, mode, index, wi)
     73         for wid in word_ids:
     74             n = latticenode.LatticeNode(self.lexicon, 0, 0, 0, wid)
---> 75             n.begin = offset
     76             offset += n.get_word_info().head_word_length
     77             n.end = offset

AttributeError: 'sudachipy.latticenode.LatticeNode' object has no attribute 'begin'
sorami commented 4 years ago

I have fixed the case, and added a test for this method in #134.

I am now looking at other parts of code that the Cythonization may affect (i.e., related to Lattice and LatticeNode) which we missed due to lack of test.

sorami commented 4 years ago

Memo about splitting in A or B mode;

When using Tokenizer to split text, the splitting from C mode to A/B mode is done by the method Tokenizer._split_path().

However, there are separate methods Morpheme.split() and MorphemeList.split() which is independent from the above Tokenizer method.

And there were no test cases for the latter, therefore this issue was not discovered until now.

polm commented 4 years ago

Sorry I missed this issue too... I thought I check the Cythonized attributes during development but obviously I missed some. I'll take a look and see what else I missed.

sorami commented 4 years ago

I have released yet another version v0.4.9 to fix this issue.

hiroshi-matsuda-rit commented 4 years ago

Thank you so mcuh! @sorami and @polm