README

Note: This readme template is based on one from the Good Docs Project. You can find it and a guide to filling it out here. (Erase this note after filling out the readme.)

STT Combine Datasets

Owner(s)

RFXs

Requests for work (RFWs) and requests for comments (RFCs) associated with this project:

RFW0114: Recreate the benchmark dataset, ensuring a more balanced distribution across all departments
RFW0125: Technical documentation about STT and TTS workflow for other developers

Project description • Who this project is for • Project dependencies • Instructions for use • Contributing guidelines • Additional documentation • How to get help • Terms of use

Project description

STT Combine Datasets helps you combine datasets from three different sources: stt.pecha.tools, prodigy and saymore for mv. The data from these sources are combined into a single 04_combine_all.tsv file. Benchmark dataset is created from this combined tsv.

Who this project is for

This project is intended for STT data pipeline maintainer who wants to update the aggrigate STT/TTS dataset.

Project dependencies

Before using STT Combine dataset, ensure you have:

saymore-report-generator

Instructions for use

Get started with STT Combine dataset by going through

01_stt_pecha_tools.ipynb
- Collect Data from stt.pecha.tools
02_prodigy.ipynb
- Collect Data from legacy prodigy instance
03_mv_saymore.ipynb
- Collect Data from STT Movie (MV). The data for Movie is collected from different repository in MonlamAI GitHub repository.
04_combine_all.ipynb
- Combine all the data into one dataframe.
05_benchmark.ipynb
- Create benchmark dataset using the highest quality grade of 3. Which means the task has been checked by the quality control team.

Install STT Combine dataset

Clone the repository

git clone git@github.com:OpenPecha/stt-combine-datasets.git
Create a virtual environment

a. Create a virtual environment

python3 -m venv .env

b. Acticate the envronment

source .env/bin/activate
Install required packages

pip install -r requirements.txt

Contributing guidelines

If you'd like to help out, check out our contributing guidelines.

Additional documentation

For more information:

How to get help

File an issue.
Email us at openpecha[at]gmail.com.
Join our discord.

Terms of use

Project Name is licensed under the MIT License.

OpenPecha / stt-combine-datasets

readme