AIM-Harvard / DeepCAC

Fully automatic coronary calcium risk assessment using Deep Learning.
GNU General Public License v3.0
37 stars 29 forks source link

Nvidia driver/cuda requirement? #3

Closed jpcenteno80 closed 3 years ago

jpcenteno80 commented 3 years ago

Sorry to bug you again, but with a 4 GPU instance, I ran into this error when executing python run_step1_heart_localization.py:

Deep Learning model inference using 4xGPUs:
Loading saved model from "../data/step1_heartloc/model_weights/step1_heartloc_model_weights.hdf5"
Compiling multi GPU model...
Traceback (most recent call last):
  File "run_step1_heart_localization.py", line 153, in <module>
    weights_file_name = weights_file_name)
  File "/home/jpcenteno/development/DeepCAC/src/step1_heartloc/run_inference.py", line 147, in run_inference
    ext = extended)
  File "/home/jpcenteno/development/DeepCAC/src/step1_heartloc/heartloc_model.py", line 54, in get_unet_3d
    initial_learning_rate=initial_learning_rate, mgpu=mgpu)
  File "/home/jpcenteno/development/DeepCAC/src/step1_heartloc/heartloc_model.py", line 113, in get_unet_3d_4
    parallel_model = multi_gpu_model(model, gpus=mgpu)
  File "/home/jpcenteno/venv/lib/python2.7/site-packages/tensorflow/python/keras/utils/multi_gpu_utils.py", line 182, in multi_gpu_model
    available_devices))
ValueError: To call `multi_gpu_model` with `gpus=4`, we expect the following devices to be available: ['/cpu:0', '/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3']. However this machine only has: ['/cpu:0', '/xla_gpu:0', '/xla_gpu:1', '/xla_gpu:2', '/xla_gpu:3', '/xla_cpu:0']. Try
 reducing `gpus`.

So I had to change line 171 of venv/lib/python2.7/site-packages/tensorflow/python/keras/utils/multi_gpu_utils.py. From: target_devices = ['/cpu:0'] + ['/gpu:%d' % i for i in target_gpu_ids] To: target_devices = ['/cpu:0'] + ['/xla_gpu:%d' % i for i in target_gpu_ids]

This is what I am running on Nvidia: image

That fixed the issue.

9zelle9 commented 3 years ago

We tested the code on Ubuntu 18.04 with Cuda 10.1 and libcudnn 7.6.

About the error message, this is beyond our code. We did not make any changes to the Keras libs.

mickalus1 commented 3 years ago

I had the same issue and resolved it by modifying some python scripts, see the answer from Michele Bianco at https://stackoverflow.com/questions/52950449/valueerror-when-using-multi-gpu-model-in-keras. Runs fine for me :)