Closed LasseRegin closed 7 years ago
Haha, I tried to be the hero, but I ended up getting an CUDA_ERROR_INVALID_DEVICE
error every time I tried to create a tensorflow session tf.Session()
with a gpu available. I will try and recreate it and post the full error message.
I am trying to figure out why this is not working. Maybe adding OpenCL will help, I don't know. 😛
Adding OpenCL didn't help. I still get this error
>>> tf.Session()
E tensorflow/core/common_runtime/direct_session.cc:135] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/zhome/77/3/77734/stdpy3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1186, in __init__
super(Session, self).__init__(target, graph, config=config)
File "/zhome/77/3/77734/stdpy3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 551, in __init__
self._session = tf_session.TF_NewDeprecatedSession(opts, status)
File "/appl/python/3.5.1/lib/python3.5/contextlib.py", line 66, in __exit__
next(self.gen)
File "/zhome/77/3/77734/stdpy3/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.
If I try creating a session again it will repeat the error but with the number StreamExecutor for CUDA device ordinal 1
incremented. This will happen until it runs out of gpu devices to try, and then it will fall back to the CPU and not throw the error.
Did you specify CUDA_VISIBLE_DEVICES
? see:
https://github.com/AndreasMadsen/my-setup/tree/master/dtu-hpc-python3#known-issues
Yes. Same error unfortunately.
Yes. Same error unfortunately.
Hmm, I will look into it.
Using nvidia-smi
I can see that some MatLab user is using 3/4 GPUs on the K40 machine. When setting CUDA_VISIBLE_DEVICES
it has to be a GPU that is not used. I my case I first tried CUDA_VISIBLE_DEVICES=3 which failed with the error you specified, however CUDA_VISIBLE_DEVICES=0
worked.
PS: I ran your script, everything appears to work. When you remove the NOTE
I will merge it.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48 Driver Version: 367.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K40c On | 0000:02:00.0 Off | 0 |
| 30% 66C P0 119W / 235W | 1651MiB / 11439MiB | 68% E. Process |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K20c On | 0000:03:00.0 Off | 0 |
| 35% 46C P0 49W / 225W | 65MiB / 4742MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K20c On | 0000:83:00.0 Off | 0 |
| 40% 52C P0 106W / 225W | 1530MiB / 4742MiB | 77% E. Process |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX TIT... On | 0000:84:00.0 Off | N/A |
| 22% 36C P8 30W / 250W | 2MiB / 12206MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 13055 C /appl/matlab/850/bin/glnxa64/MATLAB 1649MiB |
| 1 3808 C /appl/matlab/900/bin/glnxa64/MATLAB 63MiB |
| 2 12824 C /appl/matlab/850/bin/glnxa64/MATLAB 1528MiB |
+-----------------------------------------------------------------------------+
I think that CUDA_VISIBLE_DEVICES=(GPU_FAN + 1) mod 4
, so in this case CUDA_VISIBLE_DEVICES=0
is the way to go.
Regarding the CUDA_VISIBLE_DEVICES
issue you should follow this thread: https://github.com/tensorflow/tensorflow/issues/152
I removed the note and changed the patch url in merge. Landed in 1e584cf5819bfaed9f3280d664e510f268efe86b
Awesome work 💯 👍
Damn those MatLab users! 👿
But awesome and thanks! 👍 🎉
Updated tensorflow to v. 0.12.1 and bazel to 0.4.2 (required for newer tensorflow).
NOTE: If fork is merged the following line of
setup-python3.sh
(L. 239)should be changed to