NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.84k stars 2.46k forks source link

Speed and corvengence of QuartzNet #1754

Closed pe-trik closed 3 years ago

pe-trik commented 3 years ago

Hi,

first of all, thanks for the great library!

I've been using this library for more than a year. I train mostly QuartzNet models for English and Czech. Since the new version (>1.0.*) I experience ~3x slowdown compared to older versions (<v0.11.0) on my older GPUs (GTX 1080Ti).

I train on eight to ten GPUs, with mixed precision and a batch size of 32.

Previously, one training step took approx. 3s, with the new version it takes approx. 15s. When I open the nvidia-smi tool, I observe strange behaviour: most of the time, GPU-utilization is 100%, but the power usage is half the time about 80W and then about 150W (out of 250W). This is unlike on my newer GPUs (Quadro RTX 5000) where the utilization changes a lot (but still is about 100%) and power usage is most of the time >200W (out of 230W).

Also, when I scale down the training to fewer GPUs I am struggling to get the model to converge. Previously, I've just adjusted accumulate_grad_batches accordingly and it worked.

I suspect this behaviour stems from the backend. In previous versions, I used Apex. With the current version, it seems that it is not possible anymore (it states that the Apex is not supported).

Might the slowdown be caused by not using Apex? Is it somehow possible to use it with the current version? Can you please specify the exact parameters to replicate the QuartzNet training on LibriSpeech on one and more GPUs using mixed precision?

Thanks, Peter

Environment details

Additional context

CUDA 10.2, Driver (on both, new and old GPUs) 440.33.01

titu1994 commented 3 years ago

Could you share your training config file for QuartzNet ? The defaults for the pytorch lightning training logs at every step, which is excessive, and can be a cause for such a significant drop in training speed.

okuchaiev commented 3 years ago

I am also wondering about num_workers parameter for data layers

pe-trik commented 3 years ago

quartznet.yaml.txt

And I start the training with these parameters:

hydra.run.dir="." \
trainer.gpus=4 \
+trainer.max_steps=200000 \
trainer.log_every_n_steps=100 \
model.train_ds.batch_size=32 \
+trainer.precision=16 \
+trainer.amp_level=O1  \
trainer.accumulate_grad_batches=2 \
trainer.val_check_interval=2000\
+model.validation_ds.num_workers=1  \
+model.train_ds.num_workers=8 \
+model.train_ds.pin_memory=True \
model.optim.lr=0.01

Regarding the num_workers: I have good experience with having two workers per GPU.

titu1994 commented 3 years ago

The config and overrides seem fine. I see you are using amp_level=O1, please note that since Pytorch 1.6, the recommended implementation of AMP is Pytorch Native AMP. I doubt thats the cause of any slowdown though.