MT00016: Sentence segmentation optimisation

tenzin3 commented 1 month ago

Description

Currently, op_mt_tools is used for sentence segmentation. However, op_mt_tools employs botok for word-level tokenization, which is time-consuming. Therefore, there is a need for another script that specifically handles sentence-level segmentation more efficiently.

Completion Criteria

new sentence segmentation script segments at rate of 95% or more accuracy compared to op_mt_tools
Faster time record than op_mt_tools for TM ids 2912 and 8002.

Implementation Plan

[x] write test cases
[x] script to normalize puncts and segment by punct
[x] script to filter out non tibetan text and keep symbols
[x] compare with op_mt_tools in terms of speed
[x] compare with op_mt_tools in terms of segmentation accuracy
[x] modify script to handle exceptions and improve accuracy

tenzin3 commented 1 month ago

Mismatch in tokenized output:>

text = "མངོན་སུམ་ཚད་མས་གྲུབ་པ་འདི་བཞིན་ནོ།།༄༅།།ཡུལ་སྐྱེ་རྒུ་མདོ་ན་མཆིས་པའི་བཙན་པོ་ཁྲི་ལྡེ་སྲོང་བཙན་སྐབས་བརྐོས་པའི་རྡོ་བརྐོས་ཡི་གེར་དཔྱད་པ།" op_mt_tools tokenizer output ="མངོན་སུམ་ཚད་མས་གྲུབ་པ་འདི་བཞིན་ནོ།།༄༅།།ཡུལ་སྐྱེ་རྒུ་མདོ་ན་མཆིས་པའི་བཙན་པོ་ཁྲི་ལྡེ་སྲོང་བཙན་སྐབས་བརྐོས་པའི་རྡོ་བརྐོས་ཡི་གེར་དཔྱད་པ།\n" new segmenter output ="མངོན་སུམ་ཚད་མས་གྲུབ་པ་འདི་བཞིན་ནོ།།\n༄༅།།\nཡུལ་སྐྱེ་རྒུ་མདོ་ན་མཆིས་པའི་བཙན་པོ་ཁྲི་ལྡེ་སྲོང་བཙན་སྐབས་བརྐོས་པའི་རྡོ་བརྐོས་ཡི་གེར་དཔྱད་པ།\n"

For id 2912, new segmenter segment with error of 0.48 percent (less than 1)

tenzin3 commented 1 month ago

Note: Here bo_sent_tokenizer is the package name and time taken is for segmenting the text file into sentences.

tenzin3 commented 1 month ago

After handling the exceptions, speed efficiency decreased a bit. text = "མངོན་སུམ་ཚད་མས་གྲུབ་པ་འདི་བཞིན་ནོ།།༄༅།།ཡུལ་སྐྱེ་རྒུ་མདོ་ན་མཆིས་པའི་བཙན་པོ་ཁྲི་ལྡེ་སྲོང་བཙན་སྐབས་བརྐོས་པའི་རྡོ་བརྐོས་ཡི་གེར་དཔྱད་པ།" new segmenter output ="མངོན་སུམ་ཚད་མས་གྲུབ་པ་འདི་བཞིན་ནོ།།\n༄༅།།ཡུལ་སྐྱེ་རྒུ་མདོ་ན་མཆིས་པའི་བཙན་པོ་ཁྲི་ལྡེ་སྲོང་བཙན་སྐབས་བརྐོས་པའི་རྡོ་བརྐོས་ཡི་གེར་དཔྱད་པ།\n"

OpenPecha / bo_sent_tokenizer