STT0055: Plan the work flow of Situ Rinpoche audio transcription and model training.

Description

We have received exclusive data for speech-to-text (STT) from a specific speaker. The task is to fine-tune the model using both this speaker's training data and the base training data, and then evaluate the model's performance on the speaker's test data. Additionally, compare the performance of the model trained solely on the base data versus the model trained on both the base data and the speaker-specific data. Finally, estimate how the model's performance might improve if we acquire more data from the speaker in the future.

Methods Fine-tune the existing model using a mix of the speaker’s training data and the base training data.
1. Using Checkpoints of best performing model and then fine tuning on new speaker data. Lora (low rank adaptation)
2. Fine tuning on both the base training data and new speaker data from scratch.

Completion Criteria

A model pipeline ready to be trained on new speaker data and results of it's performance on new speaker test data.

Implementation

prepare the catalog for raw audio data with meta data.
prepare a pipeline for uploading split audios of new speaker into pecha tools for transcribing.
prepare the training data of 10 hours.
find the various approach we could take towards training the model specific new speaker data
Evaluate the performance of existing model on new speaker data.
Train the model on new speaker data.
Evaluate the performance of new model on new speaker data.

Fine-Tuning Using Checkpoints of the Best Performing Model and Then Fine-Tuning on New Speaker Data Approach: Start with the best-performing checkpoint of the base model (already trained on the base training data) and further fine-tune it exclusively on the new speaker-specific data. Pros: Efficient Use of Existing Knowledge: This approach leverages the pre-existing training done on the base data, allowing the model to retain general knowledge while adapting specifically to the speaker's style. Faster Training Time: Since the model is already trained on the general data, it requires significantly fewer epochs to fine-tune on the speaker-specific data, saving computational time and resources. Less Computationally Expensive: Using an already trained model as a starting point reduces the cost of training from scratch, making it more cost-effective, especially with large models. Reduced Risk of Overfitting: Since only the last stage of training focuses on the speaker-specific data, the model retains general patterns learned from a larger, diverse dataset, which helps avoid overfitting to the speaker's data alone. Parameter-Efficient Fine-Tuning Options: Techniques like adapters or LoRA can be applied directly to the checkpoint to make this step even more efficient. Cons: Limited Generalization to Other Speakers: Fine-tuning on the new speaker’s data can make the model overly specialized, potentially degrading performance on data from other speakers if not done carefully. Dependence on Initial Model Quality: The success of this method heavily relies on the quality of the initial checkpoint; if the initial model isn’t robust, the final performance might not be optimal. Risk of Catastrophic Forgetting: The model might lose some of its general knowledge from the base training data if fine-tuned too extensively on the new speaker data, leading to a trade-off between general and speaker-specific performance.

Fine-Tuning from Scratch Using Both Base Training Data and New Speaker Data Approach: Train the model from scratch by mixing the base training data with the new speaker-specific data. This is a full retraining approach where the model learns both types of data simultaneously. Pros: Balanced Learning: Training with a combined dataset allows the model to learn general patterns from the base data while adapting to the specific features of the new speaker’s data, creating a more balanced model. Better Generalization: By incorporating the new speaker data throughout the training process, the model can better generalize across both the speaker-specific and base data, maintaining broader applicability. Improved Performance on Speaker Data: Integrating the new speaker data from the start of training helps the model adapt more naturally, leading to potentially better performance on the new speaker’s test set. Reduced Risk of Catastrophic Forgetting: Since the new data is part of the entire training process, there is less risk of the model forgetting the base knowledge, allowing it to retain a broader skill set. Cons: Higher Computational Cost: Training from scratch with the entire dataset requires significantly more time, resources, and computational power, making it a costly approach, especially with large models. Longer Training Time: Full retraining is time-consuming, especially if dealing with large datasets, as all parameters are updated across both datasets rather than just adapting existing weights. Overfitting Risk with Imbalanced Data: If the speaker-specific data is significantly smaller than the base data, the model might still underperform on the speaker’s test data because the base data dominates the training process. Resource-Intensive: Requires more extensive hardware and may need multiple runs to find the best training configuration, impacting overall project timelines and budgets.

OpenPecha / stt-wav2vec2