OpenPecha / Requests

RFWs and RFCs for all OpenPecha repositories
0 stars 0 forks source link

[RFC0116]: Filter out bad transcriptions for STT_MV #399

Open gangagyatso4364 opened 5 months ago

gangagyatso4364 commented 5 months ago

RFC0116: Filter out bad transcriptions for STT_MV

Named Concepts

STT_MV: Speech-to-text for Tibetan movie/TV program transcriptions. CER: Character Error Rate, a metric for transcription quality.

Summary

Developing a script to filter out poor-quality transcriptions from Tibetan movie and TV program segments. This will help in improving the stt model performance by feeding it with stt data that are of high quality.

Dependencies

botok

Infrastructures

Design Illustrations

Untitled (5)

Justification

this approach uses botok to analyse the transcription text quality , which is more effecient than manually checking for correctness of spelling in the text.

Testing

  1. Unit tests for each script component.
  2. Evaluation using a subset of the dataset to assess filtering accuracy.

Implementation Steps

List all the steps involved during implementation.

Reviewed By