databricks / spark-deep-learning

Deep Learning Pipelines for Apache Spark
https://databricks.github.io/spark-deep-learning
Apache License 2.0
1.99k stars 494 forks source link

HorovodRunner not recognizing multiple GPUs on Databricks #230

Closed mbluestone closed 4 years ago

mbluestone commented 4 years ago

Environment:

Framework: PyTorch Framework version: 1.5.0 Horovod version: 0.19.1 MPI version: mpirun (Open MPI) 3.0.0 CUDA version: 10.1 NCCL version: 2.7.3 Python version: 3.7.6 OS and version: Ubuntu 18.04.4 LTS GCC version: 7.5.0

Question:

I am running HorovodRunner on Databricks Runtime 7.0 ML with 3 Standard_NC24 GPU worker instances and it seems like not all GPUs that are available are being utilized. There are 4 GPUs on each worker, so 12 GPUs in total.

I have been running tests using the following code:

import horovod.torch as hvd
from sparkdl import HorovodRunner

def test_fn():
    hvd.init()
    print(hvd.local_rank())

hr = HorovodRunner(np=8)
hr.run(test_fn)

At one point, the output of this code was:

[1,3]<stdout>:1
[1,0]<stdout>:0
[1,1]<stdout>:0
[1,5]<stdout>:2
[1,7]<stdout>:3
[1,4]<stdout>:2
[1,2]<stdout>:1
[1,6]<stdout>:3

I then restarted the cluster and the output was:

[1,6]<stdout>:2
[1,0]<stdout>:0
[1,3]<stdout>:1
[1,7]<stdout>:2
[1,4]<stdout>:1
[1,1]<stdout>:0
[1,2]<stdout>:0
[1,5]<stdout>:1

How come HorovodRunner isn't picking up all of the GPUs available and is doubling/tripling up processes on a few GPUs? Am I doing something wrong here? Is this an issue with Horovod and not HorovodRunner?

Any help is greatly appreciated!

mbluestone commented 4 years ago

Turns out the GPU utilization is occurring exactly as it should, and the issue we were having is with Comet.ml and their system metrics tracking. I'll go ahead and close this issue.