Open kba opened 4 years ago
I cannot test this - @bertsky but it basically worked for you with nvidia-docker / CUDA 11, i.e. core itself but you haven't tested with ocrd_all I suppose?
Not yet, sorry. And it's possible the TF I ran was self-built, I am not sure. I'll report back when my CUDA device and I found some spare time.
It's not really systematic (yet), I just had problems and had a look on combinations that work reproducibly. However I can produce a complete matrix of working and non-working combinations. I'll check it out this week.
I did not/will not test self-built TensorFlow versions, because that will be an even greater pain in the butt :-)
I've added CUDA 11.0 in a new script https://cvs.moegen-wir.net/mikegerber/test-nvidia/src/branch/master/run-docker-compatibility-matrix, no success with that one.
Tested: Combinations of
tensorflow-gpu == 1.15.*
and tensorflow == 2.*
(= TF 2.3 currently)Only combinations that work with GPU support:
tensorflow-gpu == 1.15.*
with CUDA 10.0tensorflow == 2.*
with CUDA 10.1Full log:
== tf1 nvidia/cuda:10.0-cudnn7-runtime-ubuntu18.04
GPU 0: GeForce RTX 2080 (UUID: GPU-612ce75c-1340-772b-039c-2a83a3ea5c95)
TensorFlow 1.15.3
GPU available: True
== tf1 nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
GPU 0: GeForce RTX 2080 (UUID: GPU-612ce75c-1340-772b-039c-2a83a3ea5c95)
TensorFlow 1.15.3
GPU available: False
== tf1 nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04
GPU 0: GeForce RTX 2080 (UUID: GPU-612ce75c-1340-772b-039c-2a83a3ea5c95)
TensorFlow 1.15.3
GPU available: False
== tf1 nvidia/cuda:11.0-cudnn8-runtime-ubuntu18.04
GPU 0: GeForce RTX 2080 (UUID: GPU-612ce75c-1340-772b-039c-2a83a3ea5c95)
TensorFlow 1.15.3
GPU available: False
== tf2 nvidia/cuda:10.0-cudnn7-runtime-ubuntu18.04
GPU 0: GeForce RTX 2080 (UUID: GPU-612ce75c-1340-772b-039c-2a83a3ea5c95)
TensorFlow 2.3.0
GPU available: False
== tf2 nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
GPU 0: GeForce RTX 2080 (UUID: GPU-612ce75c-1340-772b-039c-2a83a3ea5c95)
TensorFlow 2.3.0
GPU available: True
== tf2 nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04
GPU 0: GeForce RTX 2080 (UUID: GPU-612ce75c-1340-772b-039c-2a83a3ea5c95)
TensorFlow 2.3.0
GPU available: False
== tf2 nvidia/cuda:11.0-cudnn8-runtime-ubuntu18.04
GPU 0: GeForce RTX 2080 (UUID: GPU-612ce75c-1340-772b-039c-2a83a3ea5c95)
TensorFlow 2.3.0
GPU available: False
So this is what we get using PyPI/pip. Self-compiled or Anaconda might be totally different.
As I said, we should also search for official documentation/confirmation of this. It seems utterly stupid that TF prebuilds would be provided so narrowly (if not surprising, given their Python versioning and backwards compatibility policy).
Another hint that this is definitely by design: https://github.com/tensorflow/tensorflow/issues/26289#issuecomment-491768524
Could we perhaps build upon the prebuilt tensorflow-gpu Docker images instead?
As I said, we should also search for official documentation/confirmation of this. It seems utterly stupid that TF prebuilds would be provided so narrowly (if not surprising, given their Python versioning and backwards compatibility policy).
Official documentation says CUDA Toolkit 10.1 for the latest pip-installable tensorflow which currently is 2.3.1. It also tells us about the older version with GPU support (pip install tensorflow-gpu==1.15
) but does not mention that that build does not work with CUDA Toolkit 10.1 (I believe this is a bug in the documentation).
Giving the history of the TF documentation in regards to GPU support (remember how they changed the package names back and forth?) I do not give much trust to it at all and prefer to test working build version combinations on every update. (A cleaner approach would certainly be to build it on our own, if possible, just like the conda packages, but I seriously believe it's too much work for not much gain. Who knows if TF 2.4 will not work anymore with any CUDA Toolkit that TF 1.15 supports?)
Another thing: the core CUDA image is based on an nvidia image without CUDNN support in addition to having the wrong CUDA Toolkit version for both "flavours" of TensorFlow (11.0 does not work).
IMHO, what needs to be done:
nvidia/cuda:10.0-cudnn7-runtime-ubuntu18.04
, nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
(10.1)I know this is a pain in the butt, but I don't see any way around it (other than trying to install both CUDA version + compatible CUDNN version into the same image and then do LD_LIBRARY_PATH tricks)
(As always, when I say TF version I mean "the TF wheel with that version on PyPI")
Update (ht @mikegerber) that at least TF 2.4+ requires CUDA 11 now.
The current Docker image for CUDA is no longer usable with ocrd_all
and cannot be rebuilt because NVIDIA .deb links seem to be broken.
@mikegerber has been systematically testing CUDA and tensorflow versions. Unfortunately, the compatibility table apparently goes like this:
I cannot test this - @bertsky but it basically worked for you with nvidia-docker / CUDA 11, i.e. core itself but you haven't tested with ocrd_all I suppose?