OpenPecha / Botok

🏷 བོད་ཏོག [pʰøtɔk̚] Tibetan word tokenizer in Python
https://botok.readthedocs.io/
Apache License 2.0
58 stars 14 forks source link

tokenizer fails #8

Closed mikkokotila closed 6 years ago

mikkokotila commented 6 years ago

I've installed with pypi and I'm doing...

import pybo as bo

# initialize the tokenizer
tok = bo.BoTokenizer('POS')

# load a string to a variable
input_str = 'འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་'

# tokenize the input
tokens = tok.tokenize(input_str)

# show the results
tokens

...at which point I get:

IndexError                                Traceback (most recent call last)
~/dev/astetik_test/lib/python3.6/site-packages/IPython/core/formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)
    703             printer.flush()
    704             return stream.getvalue()

~/dev/astetik_test/lib/python3.6/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    381                 if cls in self.type_pprinters:
    382                     # printer registered in self.type_pprinters
--> 383                     return self.type_pprinters[cls](obj, self, cycle)
    384                 else:
    385                     # deferred printer

~/dev/astetik_test/lib/python3.6/site-packages/IPython/lib/pretty.py in inner(obj, p, cycle)
    559                 p.text(',')
    560                 p.breakable()
--> 561             p.pretty(x)
    562         if len(obj) == 1 and type(obj) is tuple:
    563             # Special case for 1-item tuples.

~/dev/astetik_test/lib/python3.6/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    398                         if cls is not object \
    399                                 and callable(cls.__dict__.get('__repr__')):
--> 400                             return _repr_pprint(obj, self, cycle)
    401 
    402             return _default_pprint(obj, self, cycle)

~/dev/astetik_test/lib/python3.6/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
    693     """A pprint that just redirects to the normal repr function."""
    694     # Find newlines and replace them with p.break_()
--> 695     output = repr(obj)
    696     for idx,output_line in enumerate(output.splitlines()):
    697         if idx:

~/dev/astetik_test/lib/python3.6/site-packages/pybo/token.py in __repr__(self)
     60         out += '\nsyl chars in content'
     61         if self.syls:
---> 62             out += '(' + ' '.join([''.join([self.content[char] for char in syl]) for syl in self.syls]) + '): '
     63         else:
     64             out += ': '

~/dev/astetik_test/lib/python3.6/site-packages/pybo/token.py in <listcomp>(.0)
     60         out += '\nsyl chars in content'
     61         if self.syls:
---> 62             out += '(' + ' '.join([''.join([self.content[char] for char in syl]) for syl in self.syls]) + '): '
     63         else:
     64             out += ': '

~/dev/astetik_test/lib/python3.6/site-packages/pybo/token.py in <listcomp>(.0)
     60         out += '\nsyl chars in content'
     61         if self.syls:
---> 62             out += '(' + ' '.join([''.join([self.content[char] for char in syl]) for syl in self.syls]) + '): '
     63         else:
     64             out += ': '

IndexError: string index out of range

If I don't load tok.tokenize(input_str) in to variable, then the error comes in that step.

drupchen commented 6 years ago

hmmm. This is a good bug. It is most probably a bug in the string representation of a token in the list that comes from a bug in the splitting syllables with affixes into two distinct tokens. It looks likee the attribute token.syls does not have the expected content.

What the line 62 does is find the actual characters in token.content by using the indices listed in token.syls It looks like token.syls has not been correctly split in pybo.splitaffixed.py, in a private function called __split_syls().

Could you try with the latest version I have pushed ?

drupchen commented 6 years ago

By the way, the tokenizer did not fail, it is the string representation of the content of the Token object that fails. There still is a bug somewhere, but not big enough to prevent the tokenizer to function altogether. It would not have passed by the line "tokens = tok.tokenize(input_str)" otherwise.

mikkokotila commented 6 years ago

I tried installing from the latest master, but issue is still there.

drupchen commented 6 years ago

I don't seem to be able to reproduce the bug, using the configuration that is in the latest master.

It seems you can't print one of the produced tokens.

Maybe a way to identify it would be to do something like the following:

for num, token in enumerate(tokens):
    print(num)  # to identify which token fails to print
    print(token)  # this calls the __repr__() responsible for the failing `cleaned_content`

Something else: if this code executes without problem, Ipython/Jupyter maybe has problems rendering the @property attributes of classes.

Could you elaborate on what you do instead of loading the output of tok.tokenize() into a variable and how it only happens then ?

mikkokotila commented 6 years ago

Very strange, without actually reinstalling pybo, things work now. In the between what happened was #9 as a new issue, which was easy to resolve as I mention in #9. I'm using an env that went through some other changes in the meantime, so it probably had to do with that. Will try to reproduce later with a clean env and report based on that.

drupchen commented 6 years ago

Thank you!