STT0043: Training STT wav2vec model on GCP.

gangagyatso4364 commented 4 months ago

Description

We need to train stt-wav2vec2 model on the new datasets that we have gained also because of the new departments data introduced.

Completion Criteria

Stt wav2vec2 model with better performance and ready to be deployed.

Implementation

Subtask

[x] get the combined new dataset from stt pecha tools and prodigy
[x] create a benchmark of stt data from combined dataset for all the departments
[x] create training and validation spilt from the combined data excluding the benchmark data.
[x] upload to aws s3 bucket monlam.ai.stt/tsv/ training, validation, benchmark.
[x] prepare the dataset from training, validation, and benchmark dataset.
[x] get the vastai instance with required GPU.
[ ] train the model on prepared dataset.
[ ] evaluate the model the performance.
[ ] get the progress of model and save checkpoints
[ ] save the model to hugging face.

Estimations

  Parameters:
  Notebook instance sagemaker: ml.g5.8Xlarge (24 GPU RAM, NVIDIA A10G MODEL)
  Batch Size: 8
  Gradient Accumulation Steps: 2
  Number of Training Examples: 1,071,157

Estimation of Time for training

Preprocessing Time:

      For 10,000 examples: 6.40 seconds
      For 100,000 examples: ~64 to 67 minutes
      For 1,071,157 examples: ~684.34 minutes ≈11.41 hours

Training Time:

      For 10,000 examples: 6.10 hours 
      For 100,000 examples: ~61.00 hours
      For 1,071,157 examples: ~653.40 hours

Total Time Required:

      Total Preprocessing Time ≈ 11.41 hours
      Total Training Time ≈ 653.40hours
      Total Estimated Time=11.41 hours+653.40 hours≈664.81 hours

Cost Estimation:

     Total Hours:   664.81 hours
     Hourly Cost:   $3 per hour

     Total Cost:
     664.81 hours×3 dollars/hour=1,994.43 dollars

Summary

     Total Estimated Cost for Training: $1,994.43

kaldan007 commented 4 months ago

will be working on data validation before running prepare data. will be working on exception handling.

gangagyatso4364 commented 4 months ago

configure setup to see the progress of model training in wandb.

gangagyatso4364 commented 4 months ago

saving the processor while doing prepare dataset.

gangagyatso4364 commented 4 months ago

using AutoProcessor library to reload the existing processor

gangagyatso4364 commented 4 months ago

set up the training of model in vastai

gangagyatso4364 commented 3 months ago

still facing the same issue in vastai. having discussion with yash to resolve the issue.

ta4tsering commented 3 months ago

will work on limited training data to learn the ropes

gangagyatso4364 commented 3 months ago

make the estimation of training time and cost on an vast ai instance using smaller subset of data. and chose the best option by comparing pros and cons.

gangagyatso4364 commented 1 month ago

we will train the wav2vec2 data with our latest training data. Since Yash Model does not have Tsek AND Shed showing capability. So, we will be training our own model with tsek and shed visibility in our inference.

OpenPecha / stt-wav2vec2