Speed and corvengence of QuartzNet

pe-trik commented 3 years ago

Hi,

first of all, thanks for the great library!

I've been using this library for more than a year. I train mostly QuartzNet models for English and Czech. Since the new version (>1.0.*) I experience ~3x slowdown compared to older versions (<v0.11.0) on my older GPUs (GTX 1080Ti).

I train on eight to ten GPUs, with mixed precision and a batch size of 32.

Previously, one training step took approx. 3s, with the new version it takes approx. 15s. When I open the nvidia-smi tool, I observe strange behaviour: most of the time, GPU-utilization is 100%, but the power usage is half the time about 80W and then about 150W (out of 250W). This is unlike on my newer GPUs (Quadro RTX 5000) where the utilization changes a lot (but still is about 100%) and power usage is most of the time >200W (out of 230W).

Also, when I scale down the training to fewer GPUs I am struggling to get the model to converge. Previously, I've just adjusted accumulate_grad_batches accordingly and it worked.

I suspect this behaviour stems from the backend. In previous versions, I used Apex. With the current version, it seems that it is not possible anymore (it states that the Apex is not supported).

Might the slowdown be caused by not using Apex? Is it somehow possible to use it with the current version? Can you please specify the exact parameters to replicate the QuartzNet training on LibriSpeech on one and more GPUs using mixed precision?

Thanks, Peter

Environment details

OS version Ubuntu 18.04
PyTorch version 1.7.1
Python version Python 3.7.6

Additional context

CUDA 10.2, Driver (on both, new and old GPUs) 440.33.01

titu1994 commented 3 years ago

Could you share your training config file for QuartzNet ? The defaults for the pytorch lightning training logs at every step, which is excessive, and can be a cause for such a significant drop in training speed.

okuchaiev commented 3 years ago

I am also wondering about num_workers parameter for data layers

pe-trik commented 3 years ago

quartznet.yaml.txt

And I start the training with these parameters:

hydra.run.dir="." \
trainer.gpus=4 \
+trainer.max_steps=200000 \
trainer.log_every_n_steps=100 \
model.train_ds.batch_size=32 \
+trainer.precision=16 \
+trainer.amp_level=O1  \
trainer.accumulate_grad_batches=2 \
trainer.val_check_interval=2000\
+model.validation_ds.num_workers=1  \
+model.train_ds.num_workers=8 \
+model.train_ds.pin_memory=True \
model.optim.lr=0.01

Regarding the num_workers: I have good experience with having two workers per GPU.

titu1994 commented 3 years ago

The config and overrides seem fine. I see you are using amp_level=O1, please note that since Pytorch 1.6, the recommended implementation of AMP is Pytorch Native AMP. I doubt thats the cause of any slowdown though.

NVIDIA / NeMo

Speed and corvengence of QuartzNet #1754