NVIDIA / DeepLearningExamples

State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure.
12.94k stars 3.12k forks source link

Multinode implementation of Tensorflow with Horovod #1332

Open sowmya04101998 opened 11 months ago

sowmya04101998 commented 11 months ago

Related to Model/Framework(s) The training script is from https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/Classification/RN50v1.5/main.py

Describe the bug I am trying to run the code on multinode , mutigpu , the steps mentioned in the documentation is for docker however im using enroot .i was able to run the code for single gpu and multigpu within single node , however im not able to run it for multinode

To Reproduce Steps to reproduce the behavior:

Gpu node 1 cd /scratch/cdacapp/enroot/ enroot create --name tensorflow1 nvidia+tensorflow+21.06-tf1-py3.sqsh enroot start --mount /scratch/cdacapp/ --rw tensorflow1 cd /scratch/cdacapp/tensorflow/DeepLearningExamples/TensorFlow/Classification/ConvNets horovodrun -np 4 -H ip:4,ip:2 -p 6655 python3 main.py --mode=training_benchmark --amp --batch_size 128 --data_dir=/scratch/cdacapp/tfrecords/tf_records --results_dir=/scratch/cdacapp/tensorflow/results

Gpu node 2 cd /scratch/cdacapp/enroot/ enroot create --name tensorflow1 nvidia+tensorflow+21.06-tf1-py3.sqsh enroot start --mount /scratch/cdacapp/ --rw tensorflow1 cd /scratch/cdacapp/tensorflow/DeepLearningExamples/TensorFlow/Classification/ConvNets horovodrun -np 4 -H ip:4,ip:2 -p 6655 python3 main.py --mode=training_benchmark --amp --batch_size 128 --data_dir=/scratch/cdacapp/tfrecords/tf_records --results_dir=/scratch/cdacapp/tensorflow/results

I am facing the below error err

Does this code have implementation for multi-node multi-gpu ? Please guide me on this as I'm using enroot I've tried running the main.py using slurm script ,which offloads the task to only 1 gpu , what am I doing wrong ?