CUDA version woes - Githubissues

kba commented 4 years ago

@mikegerber has been systematically testing CUDA and tensorflow versions. Unfortunately, the compatibility table apparently goes like this:

tensorflow	CUDA
>= 1.15, < 2.1	10.0
2.1, 2.2, 2.3	10.1
2.3	11 (not sure about that either)

I cannot test this - @bertsky but it basically worked for you with nvidia-docker / CUDA 11, i.e. core itself but you haven't tested with ocrd_all I suppose?

bertsky commented 4 years ago

I cannot test this - @bertsky but it basically worked for you with nvidia-docker / CUDA 11, i.e. core itself but you haven't tested with ocrd_all I suppose?

Not yet, sorry. And it's possible the TF I ran was self-built, I am not sure. I'll report back when my CUDA device and I found some spare time.

mikegerber commented 4 years ago

It's not really systematic (yet), I just had problems and had a look on combinations that work reproducibly. However I can produce a complete matrix of working and non-working combinations. I'll check it out this week.

I did not/will not test self-built TensorFlow versions, because that will be an even greater pain in the butt :-)

mikegerber commented 4 years ago

I've added CUDA 11.0 in a new script https://cvs.moegen-wir.net/mikegerber/test-nvidia/src/branch/master/run-docker-compatibility-matrix, no success with that one.

Tested: Combinations of

pip-installable tensorflow-gpu == 1.15.* and tensorflow == 2.* (= TF 2.3 currently)
CUDA Docker images 10.0, 10.1, 10.2 with CUDNN 7, 11.0 with CUDNN 8

Only combinations that work with GPU support:

tensorflow-gpu == 1.15.* with CUDA 10.0
tensorflow == 2.* with CUDA 10.1

Full log:

== tf1 nvidia/cuda:10.0-cudnn7-runtime-ubuntu18.04
GPU 0: GeForce RTX 2080 (UUID: GPU-612ce75c-1340-772b-039c-2a83a3ea5c95)
TensorFlow 1.15.3
GPU available: True
== tf1 nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
GPU 0: GeForce RTX 2080 (UUID: GPU-612ce75c-1340-772b-039c-2a83a3ea5c95)
TensorFlow 1.15.3
GPU available: False
== tf1 nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04
GPU 0: GeForce RTX 2080 (UUID: GPU-612ce75c-1340-772b-039c-2a83a3ea5c95)
TensorFlow 1.15.3
GPU available: False
== tf1 nvidia/cuda:11.0-cudnn8-runtime-ubuntu18.04
GPU 0: GeForce RTX 2080 (UUID: GPU-612ce75c-1340-772b-039c-2a83a3ea5c95)
TensorFlow 1.15.3
GPU available: False
== tf2 nvidia/cuda:10.0-cudnn7-runtime-ubuntu18.04
GPU 0: GeForce RTX 2080 (UUID: GPU-612ce75c-1340-772b-039c-2a83a3ea5c95)
TensorFlow 2.3.0
GPU available: False
== tf2 nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
GPU 0: GeForce RTX 2080 (UUID: GPU-612ce75c-1340-772b-039c-2a83a3ea5c95)
TensorFlow 2.3.0
GPU available: True
== tf2 nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04
GPU 0: GeForce RTX 2080 (UUID: GPU-612ce75c-1340-772b-039c-2a83a3ea5c95)
TensorFlow 2.3.0
GPU available: False
== tf2 nvidia/cuda:11.0-cudnn8-runtime-ubuntu18.04
GPU 0: GeForce RTX 2080 (UUID: GPU-612ce75c-1340-772b-039c-2a83a3ea5c95)
TensorFlow 2.3.0
GPU available: False

So this is what we get using PyPI/pip. Self-compiled or Anaconda might be totally different.

bertsky commented 4 years ago

As I said, we should also search for official documentation/confirmation of this. It seems utterly stupid that TF prebuilds would be provided so narrowly (if not surprising, given their Python versioning and backwards compatibility policy).

Another hint that this is definitely by design: https://github.com/tensorflow/tensorflow/issues/26289#issuecomment-491768524

Could we perhaps build upon the prebuilt tensorflow-gpu Docker images instead?

mikegerber commented 4 years ago

As I said, we should also search for official documentation/confirmation of this. It seems utterly stupid that TF prebuilds would be provided so narrowly (if not surprising, given their Python versioning and backwards compatibility policy).

Official documentation says CUDA Toolkit 10.1 for the latest pip-installable tensorflow which currently is 2.3.1. It also tells us about the older version with GPU support (pip install tensorflow-gpu==1.15) but does not mention that that build does not work with CUDA Toolkit 10.1 (I believe this is a bug in the documentation).

Giving the history of the TF documentation in regards to GPU support (remember how they changed the package names back and forth?) I do not give much trust to it at all and prefer to test working build version combinations on every update. (A cleaner approach would certainly be to build it on our own, if possible, just like the conda packages, but I seriously believe it's too much work for not much gain. Who knows if TF 2.4 will not work anymore with any CUDA Toolkit that TF 1.15 supports?)

mikegerber commented 4 years ago

Another thing: the core CUDA image is based on an nvidia image without CUDNN support in addition to having the wrong CUDA Toolkit version for both "flavours" of TensorFlow (11.0 does not work).

IMHO, what needs to be done:

Run TF1.15 based processors on an image based on nvidia/cuda:10.0-cudnn7-runtime-ubuntu18.04,
Run TF2.3 based processors on an image based on nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 (10.1)

I know this is a pain in the butt, but I don't see any way around it (other than trying to install both CUDA version + compatible CUDNN version into the same image and then do LD_LIBRARY_PATH tricks)

(As always, when I say TF version I mean "the TF wheel with that version on PyPI")

kba commented 3 years ago

Update (ht @mikegerber) that at least TF 2.4+ requires CUDA 11 now.

stweil commented 2 years ago

The current Docker image for CUDA is no longer usable with ocrd_all and cannot be rebuilt because NVIDIA .deb links seem to be broken.

OCR-D / core

CUDA version woes #595