AndreasMadsen / my-setup

Guides for myself and other, on how some of my stuff is configured.
6 stars 5 forks source link

Update of bazel and tensorflow #6

Closed LasseRegin closed 7 years ago

LasseRegin commented 7 years ago

Updated tensorflow to v. 0.12.1 and bazel to 0.4.2 (required for newer tensorflow).

NOTE: If fork is merged the following line of setup-python3.sh (L. 239)

curl -L https://raw.githubusercontent.com/LasseRegin/my-setup/master/dtu-hpc-python3/tensorflow.patch | git am -

should be changed to

curl -L https://raw.githubusercontent.com/AndreasMadsen/my-setup/master/dtu-hpc-python3/tensorflow.patch | git am -
LasseRegin commented 7 years ago

Haha, I tried to be the hero, but I ended up getting an CUDA_ERROR_INVALID_DEVICE error every time I tried to create a tensorflow session tf.Session() with a gpu available. I will try and recreate it and post the full error message. I am trying to figure out why this is not working. Maybe adding OpenCL will help, I don't know. 😛

LasseRegin commented 7 years ago

Adding OpenCL didn't help. I still get this error

>>> tf.Session()
E tensorflow/core/common_runtime/direct_session.cc:135] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/zhome/77/3/77734/stdpy3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1186, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/zhome/77/3/77734/stdpy3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 551, in __init__
    self._session = tf_session.TF_NewDeprecatedSession(opts, status)
  File "/appl/python/3.5.1/lib/python3.5/contextlib.py", line 66, in __exit__
    next(self.gen)
  File "/zhome/77/3/77734/stdpy3/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

If I try creating a session again it will repeat the error but with the number StreamExecutor for CUDA device ordinal 1 incremented. This will happen until it runs out of gpu devices to try, and then it will fall back to the CPU and not throw the error.

AndreasMadsen commented 7 years ago

Did you specify CUDA_VISIBLE_DEVICES? see: https://github.com/AndreasMadsen/my-setup/tree/master/dtu-hpc-python3#known-issues

LasseRegin commented 7 years ago

Yes. Same error unfortunately.

AndreasMadsen commented 7 years ago

Yes. Same error unfortunately.

Hmm, I will look into it.

AndreasMadsen commented 7 years ago

Using nvidia-smi I can see that some MatLab user is using 3/4 GPUs on the K40 machine. When setting CUDA_VISIBLE_DEVICES it has to be a GPU that is not used. I my case I first tried CUDA_VISIBLE_DEVICES=3 which failed with the error you specified, however CUDA_VISIBLE_DEVICES=0 worked.

PS: I ran your script, everything appears to work. When you remove the NOTE I will merge it.


+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K40c          On   | 0000:02:00.0     Off |                    0 |
| 30%   66C    P0   119W / 235W |   1651MiB / 11439MiB |     68%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20c          On   | 0000:03:00.0     Off |                    0 |
| 35%   46C    P0    49W / 225W |     65MiB /  4742MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K20c          On   | 0000:83:00.0     Off |                    0 |
| 40%   52C    P0   106W / 225W |   1530MiB /  4742MiB |     77%   E. Process |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TIT...  On   | 0000:84:00.0     Off |                  N/A |
| 22%   36C    P8    30W / 250W |      2MiB / 12206MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     13055    C   /appl/matlab/850/bin/glnxa64/MATLAB           1649MiB |
|    1      3808    C   /appl/matlab/900/bin/glnxa64/MATLAB             63MiB |
|    2     12824    C   /appl/matlab/850/bin/glnxa64/MATLAB           1528MiB |
+-----------------------------------------------------------------------------+

I think that CUDA_VISIBLE_DEVICES=(GPU_FAN + 1) mod 4, so in this case CUDA_VISIBLE_DEVICES=0 is the way to go.

AndreasMadsen commented 7 years ago

Regarding the CUDA_VISIBLE_DEVICES issue you should follow this thread: https://github.com/tensorflow/tensorflow/issues/152

AndreasMadsen commented 7 years ago

I removed the note and changed the patch url in merge. Landed in 1e584cf5819bfaed9f3280d664e510f268efe86b

Awesome work 💯 👍

LasseRegin commented 7 years ago

Damn those MatLab users! 👿

But awesome and thanks! 👍 🎉