NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.84k stars 2.46k forks source link

Replicate quartznet results #3035

Closed Miamoto closed 2 years ago

Miamoto commented 2 years ago

Hello!

I am training Quartznet 5x5 to do some experiments of my own, after I get the baseline.

I want to first replicate results with WSJ. What are some hyperparameters you recommend for training with a small number of GPUs. Now I am training with 3 GPUs, with a learning rate of 0.05, batch 32 with DDP in PyTorch lighting, fprecision 16 and number of epochs 1000. The other hyperparameters are similiar to Quartznet bigger models that show in the paper. Also, can I set up a number of iterations to train for each batch that you recommend?

I would wanna know how fast I can converge to 11/12 of WER greedy in the test set for WSJ. I got already to around 24% of WER for validation using only 100 epochs, with 3 GPUs using learning rate of 0.01.

Thanks! Carlos

titu1994 commented 2 years ago

We haven't experimented much with QuartzNet 5x5, but you can probably give Citrinet 256 or Conformer s mall a try to train on WSJ to good results.

I don't think QuartzNet can withstand 0.05 LR on small number of GPUs. You may try 0.02 maybe. I can give better settings for Citrinet or Conformer as we have experimented more with those models on small datasets.

If you want even better scores, you can train a small ComtextNet Transducer or Conformer Transducer. They generally get very good results and faster convergence than CTC models c

Miamoto commented 2 years ago

Thanks for the answer! My goal is to do experiments with architectures that do not have explicit memory on it, like conformers. Okay, so for citrinet 256, how can I get there with 3 gpus fast? Does citrinet 256 get good results for WSJ? Thank you so much! 👍

titu1994 commented 2 years ago

Smaller models will do modestly on most datasets but will take longer to converge and have worse overall scores as compared to larger models. If strong scores are all that's needed I would suggest training larger models if memory allows. The Citrinet config is available, as well as the paper with hyper parameters and pretrained checkpoints for config definitions.

You can extract a config by using cfg = Model.from_pretraijed(name, return_config=True) and save it with OmegaConf.save(cfg, path)