Closed mikkokotila closed 6 years ago
hmmm. This is a good bug. It is most probably a bug in the string representation of a token in the list that comes from a bug in the splitting syllables with affixes into two distinct tokens. It looks likee the attribute token.syls does not have the expected content.
What the line 62 does is find the actual characters in token.content by using the indices listed in token.syls It looks like token.syls has not been correctly split in pybo.splitaffixed.py, in a private function called __split_syls().
Could you try with the latest version I have pushed ?
By the way, the tokenizer did not fail, it is the string representation of the content of the Token object that fails. There still is a bug somewhere, but not big enough to prevent the tokenizer to function altogether. It would not have passed by the line "tokens = tok.tokenize(input_str)" otherwise.
I tried installing from the latest master, but issue is still there.
I don't seem to be able to reproduce the bug, using the configuration that is in the latest master.
It seems you can't print one of the produced tokens.
Maybe a way to identify it would be to do something like the following:
for num, token in enumerate(tokens):
print(num) # to identify which token fails to print
print(token) # this calls the __repr__() responsible for the failing `cleaned_content`
Something else: if this code executes without problem, Ipython/Jupyter maybe has problems rendering the @property
attributes of classes.
Could you elaborate on what you do instead of loading the output of tok.tokenize() into a variable and how it only happens then ?
Very strange, without actually reinstalling pybo, things work now. In the between what happened was #9 as a new issue, which was easy to resolve as I mention in #9. I'm using an env that went through some other changes in the meantime, so it probably had to do with that. Will try to reproduce later with a clean env and report based on that.
Thank you!
I've installed with pypi and I'm doing...
...at which point I get:
If I don't load tok.tokenize(input_str) in to variable, then the error comes in that step.