STT0005: Train MMS 300m model (5)

spsither commented 5 months ago

Description

Train facebook/wav2vec2-xls-r-300m since it uses an order of magnitude more pretraining audio data facebook/wav2vec2-large-xlsr-53 we have been using before.

Completion Criteria

Post the model on HuggingFace OpenPecha and measure the CER on the Benchmark dataset

Implementation Plan

Run the prepare_dataset in batches and combine the datasets. Update the training script and run the training job. Continue training if the machine fails for any reason. Evaluate the model afterward.

Subtasks

[x] Run the prepare_dataset in batches and combine the datasets
[x] Update the training script and run the training job
[x] Evaluate the model on benchmark

spsither commented 5 months ago

pushed spsither/mms_300_v1.630 to HF with CER 20.50%

spsither commented 5 months ago

pushed spsither/mms_300_v1.780 on HF. This beat the benchmark with CER 20.29%

spsither commented 5 months ago

wav2vec2 and BERT have the same number of parameters. So wav2vec2 base and BERT base are the same and wav2vec2 large and BERT large have the same number of parameters.

So wav2vec2 large and MMS_300m has the same number of parameters. mms_300

model.num_parameters() 315548395

wav2vec2_run10

model.num_parameters() 315548395

spsither commented 5 months ago

Pushed openpecha/tibetan_asr_mms300_v1 to HF

Started a new run with 771.30 hours of data

spsither commented 5 months ago

Evaluating the model at step 1190000

spsither commented 5 months ago

New model pushed to HF has CER 20.26%

OpenPecha / stt-wav2vec2