NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.51k stars 2.41k forks source link

Fine-Tuning Fast Conformer CTC on Other Language #10112

Closed duckyngo closed 3 weeks ago

duckyngo commented 1 month ago

First of all, thank you for creating and maintaining this incredible framework. It has been a valuable tool for our work.

I am currently attempting to fine-tune the Fast Conformer CTC model on a Korean language dataset (1000 hours) using the pretrained English model as the starting point

Issue:

After 21 epochs of training, the validation WER remains at 1, and the learning rate does not seem to decrease.

I would greatly appreciate any guidance on what might be going wrong during the training process.

Additional Information: I attach a screenshot from W&B here for reference: image

Environment details

Additional context We have access to a 20,000-hour dataset, but since training on the full dataset would be very time-consuming, we decided to start with 1,000 hours to see if the model can converge before scaling up.

nithinraok commented 3 weeks ago

Could you also plot lr graph? Your validation loss kept on increasing. Could you start with 1024 tokens, howver you mentioned the same tokens work well for Conformer, does that mean you tried same set up with Conformer and it trained well and you are only seeing issues with FastConformer?

If possible share complete config.

duckyngo commented 3 weeks ago

Thank you for your support!

I managed to resolve the issue, and I wanted to share the solution in case others encounter a similar problem. The root cause was related to the batch size and learning rate. Since my batch size was relatively small, I found it necessary to reduce the learning rate accordingly. The default configuration’s learning rate parameters are optimized for a global batch size of 2K, so using a smaller batch size requires a lower learning rate.

Initially, the model converged well during the early stages when the learning rate was low. However, as the learning rate increased due to the warm-up settings, the training became unstable. By further reducing the learning rate, I was able to stabilize the training, and the model began converging as expected.

image

I hope this information helps others who might be facing similar challenges with smaller batch sizes. Thanks again for your support!