NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.85k stars 2.46k forks source link

Distributed training via Horovod does not make learning faster #4229

Closed EmreOzkose closed 2 years ago

EmreOzkose commented 2 years ago

Describe the bug I run text classification with bert example. Training is done well. I got approximately %90 accuracy after first epoch. Then I tried to run the same example with horovod. My change is written below:

Change line 116: trainer = pl.Trainer(plugins=[NLPDDPPlugin()], **cfg.trainer) to trainer = pl.Trainer(strategy="horovod", **cfg.trainer).

Change config file: model.train_ds.file_path = path/to/train model.validation_ds.file_path = path/to/test model.train_ds.batch_size = 8 model.validation_ds.batch_size = 8 model.dataset.num_classes = 2 trainer.devices = 1 trainer.max_epochs = 1

Then I run 2 training with this commands: horovodrun -np 1 -H localhost:1 --verbose --start-timeout 300 python text_classification_with_bert.py and horovodrun -np 2 -H localhost:2 --verbose --start-timeout 300 python text_classification_with_bert.py

For second one, I also changed trainer.devices to 2 in config file.

First training took 15:06 min and second training took 22:00 min.

Expected behavior I expect that second training should take less time.

Environment overview horovod: 0.24.3 pytorch-lightning: 1.6.3 pytorch: 1.11.0 nemo-toolkit: 1.8.2

EmreOzkose commented 2 years ago

Additional note:
elapsed time for 3 device: 15:19 elapsed time for 4 device: 14:13

VahidooX commented 2 years ago

Why are you using horovod for multi-gpu trianing? PTL and PT has native support for multi-gpu training. You may just set devices=2 to make it work on 2 GPUs.

EmreOzkose commented 2 years ago

I am also using Horovod for multi-machine case. I run the same script with different parameters:

  1. horovodrun -np 1 -H server1:1 --verbose --start-timeout 300 python text_classification_with_bert.py
  2. horovodrun -np 2 -H server1:2 --verbose --start-timeout 300 python text_classification_with_bert.py
  3. horovodrun -np 3 -H server1:3 --verbose --start-timeout 300 python text_classification_with_bert.py

Elapsed times are :

  1. 20:42
  2. 23.05
  3. 18:06

My ultimate goal of using Horovod is multi-machine, but I also wanted to observe multi-gpu in the localhost machine case.

VahidooX commented 2 years ago

This part is handled by PTL and AFAIK we don't have tested our models with horovod. I suggest to increase the batch size to the maximum possible not getting OOM to utilize the GPUs the most, then rerun the tests. Also train it for more than 1 epoch to have a better estimation. First epoch can take longer sometimes. What is the time per step you see in the progress bar in the output logs for different runs?

EmreOzkose commented 2 years ago

Actually I run experiments only 1 time. I am conducting experiments with more epochs and more batch sizes for better estimation. I will report here, thank you.

okuchaiev commented 2 years ago

Wow, I didn't know Horovod will work. But regarding multi-machine, how are they connected? Could machine-machine communication be a bottleneck for you?

EmreOzkose commented 2 years ago

I think network bandwidth have to be 10gbit or 25gbit. My bandwidth is 1gbit. I conduct some DDP experiments in speechbrain, details can be found here. In summary, if bandwidth is low, training with DDP pretend to fail, but actually the main reason is low bandwidth. I did only DDP experiments to ensure if the problem is the network, but I think my Horovod experiments also suffer from low bandwidth.