STT00056: Finetune with dalai lama stt dataset(MM24)

gangagyatso4364 commented 2 months ago

Description

We aim to enhance our speech-to-text (STT) model by fine-tuning it using exclusive speaker-specific data combined with our existing base training data. We will use Low-Rank Adaptation (LoRA), a method designed for efficient fine-tuning of large models with minimal computational overhead. This approach will enable the existing model to adapt effectively to the nuances of the speaker's voice while preserving the general knowledge acquired from the base data. The goal is to evaluate and compare the model's performance on the speaker’s test data before and after fine-tuning with LoRA, demonstrating the potential gains in accuracy and robustness.

Objective:

To fine-tune the existing STT model using LoRA with speaker-specific data.
To compare the performance of the LoRA-fine-tuned model with the baseline model trained only on base data.
To estimate performance improvements if additional speaker-specific data is acquired.

Completion Criteria

A complete pipeline that incorporates LoRA for fine-tuning the model using speaker-specific data.
Performance evaluation of the LoRA-fine-tuned model on speaker-specific test data, compared against the baseline model.
Documentation of the potential improvements and scalability of the model with future acquisitions of speaker data.

Implementation

subtask

Here’s a filled subtask list for your project on enhancing the speech-to-text model using LoRA:

Subtasks

[x] Extract Dalai Lama speech-to-text dataset.
[x] Preprocess the extracted speech data .
[x] Split the dataset into training, validation, and test sets.
[ ] Set up GCP for testing the pipeline for lora fine tuning.
[ ] run prepare dataset for dalai lama data.
[ ] Implement LoRA for fine-tuning the existing STT model with the speaker-specific training data.
[ ] Train the LoRA-fine-tuned model and monitor training metrics.
[ ] Evaluate the performance of the LoRA-fine-tuned model on the speaker-specific test data.
[ ] Compare the performance metrics of the LoRA-fine-tuned model with those of the baseline model.
[ ] Document the findings and potential performance improvements based on additional speaker-specific data acquisition.
[ ] Create visualizations of performance comparisons (e.g., accuracy, loss) for easy interpretation.
[ ] Prepare a final report summarizing the implementation process and results.

gangagyatso4364 commented 2 months ago

the dalai lama training data is extracted here: s3://monlam.ai.stt/TTS_speakers/dalai_lama.csv

gangagyatso4364 commented 1 month ago

yash will migrate the workspace to US by end of day. Then we can start running the instance for model training.

gangagyatso4364 commented 1 month ago

Situ Rinpoche data being fed for transcribing in stt.pecha.tools

gangagyatso4364 commented 1 month ago

The comparison of model before and after lora fine tuning:

The Character Error Rate (CER): is a metric that measures the accuracy of a transcription by calculating the ratio of character-level substitutions, deletions, and insertions to the total number of characters in the reference text.

Base Model CER on Dalai Lama test data: 10.36%
Lora fine tuned Model CER on Dalai Lama test data:

OpenPecha / stt-wav2vec2