STT0060: sebastian stt data conversion to training data.

Description:

We need to convert Sebastian’s data into a desired CSV file format for training a speech-to-text (STT) model. The input data is structured in a specific format, with compressed audio files and corresponding transcripts provided in 'label.txt' (Kaldi format). Each audio segment has associated metadata such as speaker ID, session ID, and audio order ID. Our goal is to create a CSV file with specific fields, including file name, department, audio URL, audio duration, and transcript. sebastian data utils

Completion Criteria:

Data Extraction: Extract the audio files(11,592) from the compressed zip format and retrieve necessary metadata from the directory structure and label.txt.
CSV Creation: Convert the extracted data(11,592) into the following format:

file_name, dept, audio_url, audio_duration_in_seconds, transcript
STT_NW_S_[speakerid]_[sessionid]_[audio_order_id]_[start]_to_[end], NW, https://example.com/wav16k/STT_NW_S_speakerid_sessionid_audio_order_id_start_to_end.wav, [duration], [transcript]

Column Definitions:
- file_name: Formatted as STT_NW_S_[speakerid]_[sessionid]_[audio_order_id]_[start]_to_[end]
- dept: Set to "NW" (indicating the department as per the STT NW project)
- audio_url: A constructed URL pointing to the audio file location on a s3://monlam.ai.stt/wav16k/
- audio_duration_in_seconds: The duration of each audio segment (retrieved from the audio files).
- transcript: The corresponding transcription from the label.txt file.

Example Subtasks:

Data Extraction:
- Extract the compressed audio files from the zip archive and place them in the appropriate directories.
- Extract the metadata from the label.txt file to obtain the transcript, speaker ID, session ID, and audio order ID.
Audio File Processing:
- Calculate the start and end times for each audio segment. Assume the start time is 0 milliseconds, and calculate the end time based on the audio file duration.
CSV Generation:
- For each audio file, generate a row in the CSV file using the following format:
  - file_name: STT_NW_S_speakerid_sessionid_audio_order_id_start_to_end.wav
  - audio_url: Generated by appending the file_name to the base URL (e.g., https://d38pmlk0v88drf.cloudfront.net/wav16k/)
  - audio_duration: Extracted from the audio files.
  - transcript: Pulled from label.txt.

Example:

file_name, dept, audio_url, audio_duration_in_seconds, transcript
STT_NW_S_006_01_144_0_to_3000, NW, https://d38pmlk0v88drf.cloudfront.net/wav16k/STT_NW_S_006_01_006_144_0_to_3000.wav, 3, ཀྲུང་གོའི་ཕྱོགས་ཀྱིས་ཅོར་ཏན་གྱིས་ གླེང་སྟེགས་ཀྱི་ཐུན་མོང་ཀྲུའུ་ཞིའི་རྒྱལ་ཁབ་ཡིན་པའི་ཆ་ནས་མཐུན་སྦྱོར་བྱ་བ་མང་བོ་བསྒྲུབ་པར་གཟེངས་བསྟོད་བྱེད་པ་དང་།

Csv File For Tibetan Speech Corpus:

file_name,dept,audio_url,audio_duration_in_seconds,transcript
STT_NW_S_006_01_144_0_to_7530,NW,https://d38pmlk0v88drf.cloudfront.net/wav16k/STT_NW_S_006_01_144_0_to_7530.wav,7.53,ཀྲུང་གོའི་ཕྱོགས་ཀྱིས་ཅོར་ཏན་གྱིས་གླེང་སྟེགས་ཀྱི་ཐུན་མོང་ཀྲུའུ་ཞིའི་རྒྱལ་ཁབ་ཡིན་པའི་ཆ་ནས་མཐུན་སྦྱོར་བྱ་བ་མང་བོ་བསྒྲུབ་པར་གཟེངས་བསྟོད་བྱེད་པ་དང་།
STT_NW_S_006_01_171_0_to_6360,NW,https://d38pmlk0v88drf.cloudfront.net/wav16k/STT_NW_S_006_01_171_0_to_6360.wav,6.36,ཀྲུང་གོའི་ཕྱོགས་ཀྱིས་ཨུ་རུ་སི་དཀར་པོའི་རང་བདག་གདམ་གསེས་ཀྱི་འཕེལ་རྒྱས་བགྲོད་ལམ་ལ་ནམ་རྒྱུན་བརྩི་འཇོག་བྱས་པ་དང་།
STT_NW_S_006_01_180_0_to_5120,NW,https://d38pmlk0v88drf.cloudfront.net/wav16k/STT_NW_S_006_01_180_0_to_5120.wav,5.12,ཀྲུང་གོའི་འཕེལ་རྒྱས་ཀྱི་མྱུར་ཚད་དང་ལས་སྒྲུབ་ལས་ཆོད་ཀྱིས་སྣང་བརྙན་ཟབ་མོ་བསྐྲུན་ཡོད་པ་དང་།

@gangagyatso4364 , please can you review this csv file sample.

OpenPecha / formatting-tib-public-speech-corpus