OpenPecha / Botok

🏷 བོད་ཏོག [pʰøtɔk̚] Tibetan word tokenizer in Python
https://botok.readthedocs.io/
Apache License 2.0
58 stars 15 forks source link

tests failing because of LemmatizeTokens().lemmatize(tokens) #13

Closed mikkokotila closed 6 years ago

mikkokotila commented 6 years ago

Do you have an idea of why this might be happening?

import pybo as bo

# 1. PREPARATION 

# 1.1. Initializing the tokenizer
tok = bo.BoTokenizer('POS')

# 1.2. Loading in text
input_str = '༄༅། །རྒྱ་གར་སྐད་དུ། བོ་དྷི་སཏྭ་ཙརྻ་ཨ་བ་ཏ་ར། བོད་སྐད་དུ། བྱང་ཆུབ་སེམས་དཔའི་སྤྱོད་པ་ལ་འཇུག་པ། །སངས་རྒྱས་དང་བྱང་ཆུབ་སེམས་དཔའ་ཐམས་ཅད་ལ་ཕྱག་འཚལ་ལོ། །བདེ་གཤེགས་ཆོས་ཀྱི་སྐུ་མངའ་སྲས་བཅས་དང༌། །ཕྱག་འོས་ཀུན་ལའང་གུས་པར་ཕྱག་འཚལ་ཏེ། །བདེ་གཤེགས་སྲས་ཀྱི་སྡོམ་ལ་འཇུག་པ་ནི། །ལུང་བཞིན་མདོར་བསྡུས་ནས་ནི་བརྗོད་པར་བྱ། །'

# -------------------------

# 2. CREATING THE OBJECTS 

# 1.1. creating pre_processed object
pre_processed = bo.PyBoTextChunks(input_str)

# 1.2. creating tokens object
tokens = tok.tokenize(input_str)

The error it throws is this:

Traceback (most recent call last):
  File "./test_script.py", line 23, in <module>
    tokens = tok.tokenize(input_str)
  File "/home/travis/build/mikkokotila/pybo/pybo/__init__.py", line 54, in tokenize
    LemmatizeTokens().lemmatize(tokens)
  File "/home/travis/build/mikkokotila/pybo/pybo/lemmatizetoken.py", line 23, in lemmatize
    if token.unaffixed_word:
  File "/home/travis/build/mikkokotila/pybo/pybo/token.py", line 88, in unaffixed_word
    return self.cleaned_content
  File "/home/travis/build/mikkokotila/pybo/pybo/token.py", line 64, in cleaned_content
    cleaned = '་'.join([''.join([self.content[idx] for idx in syl]) for syl in self.syls])
  File "/home/travis/build/mikkokotila/pybo/pybo/token.py", line 64, in <listcomp>
    cleaned = '་'.join([''.join([self.content[idx] for idx in syl]) for syl in self.syls])
  File "/home/travis/build/mikkokotila/pybo/pybo/token.py", line 64, in <listcomp>
    cleaned = '་'.join([''.join([self.content[idx] for idx in syl]) for syl in self.syls])
IndexError: string index out of range
mikkokotila commented 6 years ago

Ah, this looks like #8 ... so now this is running on Ubuntu from Python console (i.e. no Notebook). I think the issue might be that my fork is behind the current master... let me see.

mikkokotila commented 6 years ago

Ok, I synced with master and this is resolved. Closing.