Closed pe-trik closed 3 years ago
Could you share your training config file for QuartzNet ? The defaults for the pytorch lightning training logs at every step, which is excessive, and can be a cause for such a significant drop in training speed.
I am also wondering about num_workers
parameter for data layers
And I start the training with these parameters:
hydra.run.dir="." \
trainer.gpus=4 \
+trainer.max_steps=200000 \
trainer.log_every_n_steps=100 \
model.train_ds.batch_size=32 \
+trainer.precision=16 \
+trainer.amp_level=O1 \
trainer.accumulate_grad_batches=2 \
trainer.val_check_interval=2000\
+model.validation_ds.num_workers=1 \
+model.train_ds.num_workers=8 \
+model.train_ds.pin_memory=True \
model.optim.lr=0.01
Regarding the num_workers
: I have good experience with having two workers per GPU.
The config and overrides seem fine. I see you are using amp_level=O1, please note that since Pytorch 1.6, the recommended implementation of AMP is Pytorch Native AMP. I doubt thats the cause of any slowdown though.
Hi,
first of all, thanks for the great library!
I've been using this library for more than a year. I train mostly QuartzNet models for English and Czech. Since the new version (>1.0.*) I experience ~3x slowdown compared to older versions (<v0.11.0) on my older GPUs (GTX 1080Ti).
I train on eight to ten GPUs, with mixed precision and a batch size of 32.
Previously, one training step took approx. 3s, with the new version it takes approx. 15s. When I open the nvidia-smi tool, I observe strange behaviour: most of the time, GPU-utilization is 100%, but the power usage is half the time about 80W and then about 150W (out of 250W). This is unlike on my newer GPUs (Quadro RTX 5000) where the utilization changes a lot (but still is about 100%) and power usage is most of the time >200W (out of 230W).
Also, when I scale down the training to fewer GPUs I am struggling to get the model to converge. Previously, I've just adjusted
accumulate_grad_batches
accordingly and it worked.I suspect this behaviour stems from the backend. In previous versions, I used Apex. With the current version, it seems that it is not possible anymore (it states that the Apex is not supported).
Might the slowdown be caused by not using Apex? Is it somehow possible to use it with the current version? Can you please specify the exact parameters to replicate the QuartzNet training on LibriSpeech on one and more GPUs using mixed precision?
Thanks, Peter
Environment details
Additional context
CUDA 10.2, Driver (on both, new and old GPUs) 440.33.01