OpenPecha / bo_sent_tokenizer

tibetan sentence tokenizer
MIT License
1 stars 0 forks source link

Feat/bo sen tokenizer #1

Closed tenzin3 closed 2 months ago

tenzin3 commented 2 months ago

sentence tokenizing tibetan text and keeping only valid sentences.

if invalid token present: exclude the sentence if another lang present: exclude the sentence if a symbol present: filter out symbols, keep sentence