STT0070: STT wav2vec2 finetuning on situ rinpoche training data.

gangagyatso4364 commented 3 hours ago

Description

we need to train wav2vec2 model for specific speaker accent and compare the performance with the base model on test data of that particular speaker.

Completion Criteria

A model that is capable of transcribing situ rinpoche audios accurately.

Implementation

create train val and test split of situ rinpoche data.
then train the wav2vec2 model for situ rinpoche by taking the previous checkpoint.
evaluate the result of model performance on test set.
compare the performance with the base model.
subtask

[x] Extract situ rinpoche speech-to-text dataset.
[x] Preprocess the extracted speech data .
[x] Split the dataset into training, validation, and test sets.
[x] Set up vast ai for testing the pipeline for fine tuning.
[x] run prepare dataset for SR data.
[x] Train the model and monitor training metrics.
[x] Evaluate the performance of fine-tuned model on the speaker-specific test data.
[x] Compare the performance metrics of the fine-tuned model with those of the baseline model.
[x] Document the findings and potential performance improvements based on additional speaker-specific data acquisition.

gangagyatso4364 commented 3 hours ago

on test set of 105 samples (10% percent of total data): base model cer = 9.78 % finetuned model (checkpoint 21000) cer = 7.93% finetuned model (checkpoint 17500) cer = 7.97%

gangagyatso4364 commented 3 hours ago

hf link to train data and finetuned model: model training data

gangagyatso4364 commented 1 hour ago

training parameters: per_device_train_batch_size=8, # Smaller batch size to increase updates per epoch gradient_accumulation_steps=1, # Further accumulate gradients for effective batch size evaluation_strategy="steps", save_steps=500, # Save checkpoints more frequently due to limited data eval_steps=50, # Evaluate regularly to monitor overfitting logging_steps=50, learning_rate=1e-6, # Lower learning rate for finer adjustment on small data num_train_epochs=200, # Increased epochs to fully learn from the limited data save_total_limit=500, # Limit checkpoints to manage storage fp16=True, # Mixed precision for faster computation, if supported warmup_steps=100, # No warmup needed for this small dataset report_to=['wandb'], # Optional: log to WandB for tracking push_to_hub=False

vast ai instance:

1x RTX 4090, 24 GB single GPU.
training duration: 3 hours.

OpenPecha / stt-wav2vec2