Running DeepReg on the Cluster with GPU

Zhiyuan-w commented 3 years ago

tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: all CUDA-capable devices are busy or unavailable

If the bug is confirmed, would you be willing to submit a PR? (Help can be provided if you need assistance submitting a PR)

Yes

Your environment

3e372d1835fdc9468c026db3767dcf9e8d4a4b0e(commit hash)

Steps to reproduce

"deepreg_train --gpu '0' "
f"--config_path demos/{name}/{name}.yaml "
f"--log_dir demos/{name} "
"--exp_name logs_train\n"

run the code in cluster environment.

Expected behaviour

The GPU works.

Actual behaviour

The GPU may not work.

We cannot know the GPU before hand.

alkististav commented 3 years ago

I'm having the exact same issue

Zhiyuan-w commented 3 years ago

I'm having the exact same issue

You can temporarily remove these two lines in /deepreg/train.py: # set env variables os.environ["CUDA_VISIBLE_DEVICES"] = gpu os.environ["TF_FORCE_GPU_ALLOW_GROWTH"] = "true" if gpu_allow_growth else "false" It works for me.

DeepRegNet / DeepReg