For a quick overview of the project, please watch the following video:
Prior to running the code, you will need to set up the following services to set up the repo:
aiXplain: For transcribing audio, ASR models are accessed from the aiXplain platform. This repo makes use of the aixplain platform and its models as an essential element. aiXplain provides easy to use no-code AI/ ML solutions to integrate into applications such as this. They can be easily integrated into applications with a single API call.
To use the aiXplain tools, you firstly need to create an account on the aiXplain platform. Then, you can choose from the plethora of models to use directly or create pipelines that use those models in a cascade. Trying or deploying those models requires credits, which may be easily purchased from the platform.
After setting up, you need to generate a private TEAM_API_KEY from the integrations settings. Please store that safely as it will be used by the aiXplain SDK to securely access your account and models/ pipelines.
Following are some short youtube videos that explain the aiXplain platform and how to use it:
AWS S3 bucket: This is used to temporarily store data for being processed by the pipeline.
After both are set up, you should enter the relevant information and credentials in the environment files:
Configure the environment file: vars.env
.
Open the vars.env
file and add the following environment variables:
S3_BUCKET_NAME
: The name of the S3 bucket where the video and subtitles will be stored. THE BUCKET MUST BE PUBLIC.S3_DATASET_DIR
: The folder path inside the S3 bucket where the video will be stored.AWS_ACCESS_KEY_ID
: Your AWS access key ID.AWS_SECRET_ACCESS_KEY
: Your AWS secret access key.AWS_DEFAULT_REGION
: The AWS region where the S3 bucket is located.Configure the environment file: secrets.env
.
Open the secrets.env.example
file and save it as secrets.env
. Then, add the following environment variables:
TEAM_API_KEY
: aiXplain Platform API key. (Generated from aiXplain platform from Team Settings > Integrations)HUGGINGFACE_TOKEN
: Huggingface token for using the models. (You may get it from the huggingface website after signing up)To start the postgreSQL database and the redis database, run the following command.
bash docker-compose.sh start
This will create docker containers for each.
Use the following command to start celery. This is used to schedule tasks asynchronously.
celery -A src.service.tasks worker --loglevel=info --pool=threads
The following will start the backend service that handes the data processing.
uvicorn src.service.api:app --port 8089 --reload
Please note that the previous services need to be running properly for the web app to work.
You may use the following command to run the annotator app while in the project root directory.
python -m streamlit run --server.port 8501 ./src/web_app/annotator/🏠_Intro_annotator.py
You may use the following command to run the admin app while in the project root directory.
python -m streamlit run --server.port 8502 --server.maxUploadSize 8192 ./tts-qa/src/web_app/admin/🏠_Intro_admin.py
You may choose open ports of your choice.
You can upload a csv file containing the text and a zip file containing recordings. Example file may be downloaded from the frontend to see the format. Moreover, you may also extract the start and end ids of the recordings from the file names by providing a regex filter to extract those. After uploading the corresponding files, the processing will start. Once the initial processing end (visible through celery), you will need to start the trimming script using.
python ./tts-qa/src/utils/trim_asr.py
The following command generates a backup dump of the database with the timestamp as its name, which you may save to s3.
docker exec -t postgres_container_dev pg_dump -U postgres dev_tts_db > dump_`date +%Y-%m-%d"_"%H_%M_%S`.sql
cat dump_2023-08-08_10_16_24.sql | docker exec -i postgres_container_dev psql -U postgres dev_tts_db
The above is for postgres_container_dev, however, you can replace dev with prod for the production container.
Here are some queries to run to get some insights about the data.
CREATE FUNCTION ROUND(float,int) RETURNS NUMERIC AS $f$
SELECT ROUND( CAST($1 AS numeric), $2 )
$f$ language SQL IMMUTABLE;
SELECT dataset.name as dataset_name, ROUND(SUM(sample.trimmed_audio_duration) / 60, 2) AS minutes, ROUND(SUM(sample.trimmed_audio_duration) / 3600, 2) AS hours
FROM sample
JOIN dataset ON sample.dataset_id = dataset.id
WHERE dataset.name NOT LIKE '%English%' AND dataset.name NOT LIKE '%German%'
GROUP BY dataset.name
ORDER BY dataset.name;
Sum of the duration of all the samples in a dataset
SELECT SUM(sample.trimmed_audio_duration) / 60 / 60 as duration_after_trimming
FROM sample
JOIN dataset ON sample.dataset_id = dataset.id
WHERE dataset.name LIKE '%' || 'English' || '%';
Sum of the duration of all the samples for each dataset.language
SELECT dataset.language as dataset_name, ROUND(SUM(sample.trimmed_audio_duration) / 60, 2) AS minutes, ROUND(SUM(sample.trimmed_audio_duration) / 3600, 2) AS hours
FROM sample
JOIN dataset ON sample.dataset_id = dataset.id
WHERE dataset.name NOT LIKE '%English (A%'
GROUP BY dataset.language
ORDER BY dataset.language;