OpenPecha / Botok

🏷 བོད་ཏོག [pʰøtɔk̚] Tibetan word tokenizer in Python
https://botok.readthedocs.io/
Apache License 2.0
58 stars 14 forks source link

tokenizer gives IndexError #18

Closed mikkokotila closed 6 years ago

mikkokotila commented 6 years ago

The below comes up for one volume in Rinchen Terdzo, in a scan of the whole body of texts. I tried to manually reproduce but could not.

The line that is causing it seems to be:

tokens.append(self.chunks_to_token([c_idx]))

in tokenizer.py.

The trace shows that this does not appear to be the same issue with #8.

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-221-9b0503e8f2ce> in <module>()
----> 1 word2vec_pipebig('rt_raw_clean.txt')

<ipython-input-218-52a921d05e0c> in word2vec_pipebig(filename, save_model)
    196                 model = word2vec_pipeline(line)
    197             else:
--> 198                 model = word2vec_pipeline(line, build_model=model)
    199             x += 1
    200 

<ipython-input-218-52a921d05e0c> in word2vec_pipeline(docs, epochs, skipgrams, workers, save, from_file, build_model)
    136         docs = read_file(docs)
    137 
--> 138     tokens = tokenize(docs)
    139     sentences = word2vec_prep(tokens)
    140 

<ipython-input-218-52a921d05e0c> in tokenize(text)
     61             pass
     62 
---> 63     return tok.tokenize(out, split_affixes=False)
     64 
     65 

~/dev/astetik_test/lib/python3.6/site-packages/pybo/__init__.py in tokenize(self, string, split_affixes)
     50         """
     51         preprocessed = PyBoTextChunks(string)
---> 52         tokens = self.tok.tokenize(preprocessed, split_affixes=split_affixes)
     53         if self.lemmatize:
     54             LemmatizeTokens().lemmatize(tokens)

~/dev/astetik_test/lib/python3.6/site-packages/pybo/tokenizer.py in tokenize(self, pre_processed, split_affixes, debug)
    135                     current_node = None
    136 
--> 137                 tokens.append(self.chunks_to_token([c_idx]))
    138 
    139             # END OF INPUT

~/dev/astetik_test/lib/python3.6/site-packages/pybo/tokenizer.py in chunks_to_token(self, syls, tag, ttype)
    180         if len(syls) == 1:
    181             # chunk format: ([char_idx1, char_idx2, ...], (type, start_idx, len_idx))
--> 182             token_syls = [self.pre_processed.chunks[syls[0]][0]]
    183             token_type = self.pre_processed.chunks[syls[0]][1][0]
    184             token_start = self.pre_processed.chunks[syls[0]][1][1]

IndexError: list index out of range
mikkokotila commented 6 years ago

FYI I'm handling it for now with this:

    try:
        return tok.tokenize(out, split_affixes=False)
    except IndexError:
        print(out)
        return tok.tokenize('སེམས་')

So I can move forward and also catch the exact string that causes the problem.

drupchen commented 6 years ago

I would be interested of having a minimal example of input string that gives this error. It should be easy since you already do print(out)

drupchen commented 6 years ago

sorry. I didn't read properly. If you can't reproduce the bug, having at least the volume that triggers the bug might help in identifying the passage that doesn't get processed correctly.

mikkokotila commented 6 years ago

Yes I should be able to catch it later today when I run it again.

mikkokotila commented 6 years ago

I was not able to. On the contrary, the same error appeared in different volumes across several runs. I think it has something to do with the way one memory optimizer I use for regular desktop use messes up jupyter.