Closed mikkokotila closed 6 years ago
FYI I'm handling it for now with this:
try:
return tok.tokenize(out, split_affixes=False)
except IndexError:
print(out)
return tok.tokenize('སེམས་')
So I can move forward and also catch the exact string that causes the problem.
I would be interested of having a minimal example of input string that gives this error.
It should be easy since you already do print(out)
sorry. I didn't read properly. If you can't reproduce the bug, having at least the volume that triggers the bug might help in identifying the passage that doesn't get processed correctly.
Yes I should be able to catch it later today when I run it again.
I was not able to. On the contrary, the same error appeared in different volumes across several runs. I think it has something to do with the way one memory optimizer I use for regular desktop use messes up jupyter.
The below comes up for one volume in Rinchen Terdzo, in a scan of the whole body of texts. I tried to manually reproduce but could not.
The line that is causing it seems to be:
tokens.append(self.chunks_to_token([c_idx]))
in tokenizer.py.
The trace shows that this does not appear to be the same issue with #8.