I am running HorovodRunner on Databricks Runtime 7.0 ML with 3 Standard_NC24 GPU worker instances and it seems like not all GPUs that are available are being utilized. There are 4 GPUs on each worker, so 12 GPUs in total.
I have been running tests using the following code:
import horovod.torch as hvd
from sparkdl import HorovodRunner
def test_fn():
hvd.init()
print(hvd.local_rank())
hr = HorovodRunner(np=8)
hr.run(test_fn)
How come HorovodRunner isn't picking up all of the GPUs available and is doubling/tripling up processes on a few GPUs? Am I doing something wrong here? Is this an issue with Horovod and not HorovodRunner?
Turns out the GPU utilization is occurring exactly as it should, and the issue we were having is with Comet.ml and their system metrics tracking. I'll go ahead and close this issue.
Environment:
Framework: PyTorch Framework version: 1.5.0 Horovod version: 0.19.1 MPI version: mpirun (Open MPI) 3.0.0 CUDA version: 10.1 NCCL version: 2.7.3 Python version: 3.7.6 OS and version: Ubuntu 18.04.4 LTS GCC version: 7.5.0
Question:
I am running HorovodRunner on Databricks Runtime 7.0 ML with 3 Standard_NC24 GPU worker instances and it seems like not all GPUs that are available are being utilized. There are 4 GPUs on each worker, so 12 GPUs in total.
I have been running tests using the following code:
At one point, the output of this code was:
I then restarted the cluster and the output was:
How come HorovodRunner isn't picking up all of the GPUs available and is doubling/tripling up processes on a few GPUs? Am I doing something wrong here? Is this an issue with Horovod and not HorovodRunner?
Any help is greatly appreciated!