Closed drupchen closed 4 years ago
after investigation, it appears the behaviour is normal, since ཤོག་བཀྲ་
is an existing word(shog bkra/ shog bu tshon khra sna tshogs can/
).
The preprocessing correctly parses ཤོག\nབཀྲ་
into two distinct syllables, then checks the existence of ཤོག་བཀྲ་
in the trie. The \n
is kept since it is a transparent character, so in the end, ཤོག\nབཀྲ་
gets the OTHER
POS tag.
Closing the issue since botok works as expected.
gives as output:
where
བཀྲ་ ཤིས་
ought to be tokenizedབཀྲ་ཤིས་
on the second line.