Fine-tuning QuartNet15*5 with Persian long duration audio

NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html

Apache License 2.0

11.51k stars 2.42k forks source link

Fine-tuning QuartNet15*5 with Persian long duration audio #2200

Closed MelikaBahmanabadi closed 3 years ago

MelikaBahmanabadi commented 3 years ago

Hi, I used QuartNet155 and fine-tuned it with the Persian Common Voice dataset. the WER was 30%. Everything worked well and I saved it as a checkpoint. After a while, I restored the checkpoint. I fine-tuned it with another Persian data set that has a longer duration than 16.7 seconds (i.e., 2 minutes, 3minutes, and so on ). When I test the model with the long duration audio, the vocabulary disarranged and the model could not transcript any audio correctly. What's your idea for fine-tuning QuartNet155 with a long duration? Thanks

titu1994 commented 3 years ago

You generally should not train with such long sequences, since it would use far too much memory + enormous amounts of training data + compute to train a model on long segments. Generally durations of 15-20 second segments are used for training.

It would be best to segment the audio in the second dataset. If that is not possible, try to combine both the shorter and longer duration datasets and train for a long time (at least a hundred or more epochs) to fit the long duration audio corpus properly.

Also, note that you can use a QuartzNet model trained on 16.7 second audio duration, and evaluate on 2-5 minute long clips with no issue. There may be modest word error rate increase without finetuning it.

MelikaBahmanabadi commented 3 years ago

You generally should not train with such long sequences, since it would use far too much memory + enormous amounts of training data + compute to train a model on long segments. Generally durations of 15-20 second segments are used for training.

It would be best to segment the audio in the second dataset. If that is not possible, try to combine both the shorter and longer duration datasets and train for a long time (at least a hundred or more epochs) to fit the long duration audio corpus properly.

Also, note that you can use a QuartzNet model trained on 16.7 second audio duration, and evaluate on 2-5 minute long clips with no issue. There may be modest word error rate increase without finetuning it.

Thanks