STT0027: script to adjust the existence of shed in transcript inference.

gangagyatso4364 commented 2 weeks ago

Description

write a script to adjust the existence of shed in transcript inference in stt pecha tools. for given shed there should always be a space afterwards. this will reduce the annotator time significantly. shed occurance cases:

inf_text = 'ཨ་མ་ལགས་འདི་ཨ་མ་ལགས་ཀི་ལུང་པ་དེ་ག་འདྲའི་འདྲ་བོ་ཅིག་ཡོད་རེད་ཟེ། ལུང་པ། ། ལུང་པ་ལུང་ཚ་སྤོབས་པ་རེད།' adjusted_inf_text = 'ཨ་མ་ལགས་འདི་ཨ་མ་ལགས་ཀི་ལུང་པ་དེ་ག་འདྲའི་འདྲ་བོ་ཅིག་ཡོད་རེད་ཟེ། ལུང་པ། ལུང་པ་ལུང་ཚ་སྤོབས་པ་རེད།'
inf_text = 'ཨ་མ་ལགས་འདི་ཨ་མ་ལགས་ཀི་ལུང་པ་དེ་ག་འདྲའི་འདྲ་བོ་ཅིག་ཡོད་རེད་ཟེ། ལུང་པ།།ལུང་པ་ལུང་ཚ་སྤོབས་པ་རེད།' adjusted_inf_text = 'ཨ་མ་ལགས་འདི་ཨ་མ་ལགས་ཀི་ལུང་པ་དེ་ག་འདྲའི་འདྲ་བོ་ཅིག་ཡོད་རེད་ཟེ། ལུང་པ། ལུང་པ་ལུང་ཚ་སྤོབས་པ་རེད།'
inf_text = 'ཨ་མ་ལགས་འདི་ཨ་མ་ལགས་ཀི་ལུང་པ་དེ་ག་འདྲའི་འདྲ་བོ་ཅིག་ཡོད་རེད་ཟེ།ལུང་པ། །ལུང་པ་ལུང་ཚ་སྤོབས་པ་རེད།' adjusted_inf_text = 'ཨ་མ་ལགས་འདི་ཨ་མ་ལགས་ཀི་ལུང་པ་དེ་ག་འདྲའི་འདྲ་བོ་ཅིག་ཡོད་རེད་ཟེ། ལུང་པ། ལུང་པ་ལུང་ཚ་སྤོབས་པ་རེད།' refer clean transcription function in : https://github.com/OpenPecha/stt-combine-datasets/blob/main/04_combine_all.ipynb
Completion Criteria

a script that is able to handle all three cases above:

Implementation Plan

Subtasks

[x] write a script that handles the each cases above.
[x] look for any other cases of shed as well.
[x] do a test case on few of the shed misplaced cases
[x] include in the stt-split-audio script
[x] run the script on all the inference transcript in spilt audio when uploading new data in stt pecha tools.

gangagyatso4364 commented 2 weeks ago

refer stt-combine-all for text cleaning and normalization script.

gangagyatso4364 commented 2 weeks ago

update stt-split-audio with the cleaned inference text.

OpenPecha / stt-split-audio

STT0027: script to adjust the existence of shed in transcript inference. #2

Description

Completion Criteria

Implementation Plan

Subtasks