Docker: multi-version CUDA

OCR-D / ocrd_all

Master repository which includes most other OCR-D repositories as submodules

MIT License

72 stars 17 forks source link

Docker: multi-version CUDA #270

Closed bertsky closed 2 years ago

bertsky commented 3 years ago

Implements #263

bertsky commented 3 years ago

Too bad: This currently yields CUDA_ERROR_SYSTEM_DRIVER_MISMATCH in Tensorflow. Should have checked earlier (in core-cuda)...

bertsky commented 3 years ago

Too bad: This currently yields CUDA_ERROR_SYSTEM_DRIVER_MISMATCH in Tensorflow. Should have checked earlier (in core-cuda)...

It seems that the choice nvidia/cuda:11.3.1-cudnn8-runtime-ubuntu18.04 as base image now requires at least nvidia-driver-470 on the host system. I have 440 and 465 on systems available to me, neither of them can work the image. But that means we are making a sacrifice here: to be able to support the newest Tensorflow/CUDA as well, we are forcing all host systems to get a newer driver. (It just might be that upgrading the driver is easier than upgrading CUDA. But it's still quite inconvenient.)

bertsky commented 3 years ago

If you have the Nvidia repo source, you can just update cuda-drivers-470 which will take care of all dependencies. (But a fresh installation might work, too.)

Anyway, this does work (based on a locally built ocrd/core-cuda from https://github.com/OCR-D/core/pull/704).

for venv in /usr/local/sub-venv/headless-tf*; do . $venv/bin/activate && python -c "import tensorflow as tf; print(tf.test.is_gpu_available())"; done

– yields True 3x

bertsky commented 2 years ago

Conflicting files

core

How are you supposed to keep PRs alive which involve subrepos then? I guess I'll have to update https://github.com/OCR-D/core/pull/704 each time core master changes, and then in turn update here.

bertsky commented 2 years ago

So to sum up, we have two drawbacks here:

the base image size for the -cuda variants becomes even larger (for ocrd:core-cuda it's already 12 GB)
the host system needs a recent kernel driver to run the images (even for older CUDA)

But

considering what we gain here,
and how urgent this is (with https://github.com/bertsky/ocrd_detectron2/issues/7 now even blocking our ocrd/all:maximum-cuda build),
and that this can probably go away as soon as we build thin images,
and that non-Docker and non-CUDA-Docker is not even affected,
and that it also provides a solution for native installations (i.e. running make cuda-ubuntu or merely make cuda-ldconfig as fixup),
and that dragging along these PRs with other changes (esp. if you want to combine them with other branches) is a lot of effort,

I'd say let's merge!