NVIDIA / nvidia-docker

Build and run Docker containers leveraging NVIDIA GPUs
Apache License 2.0
17.17k stars 2.03k forks source link

question abour kernel and DSO version compatibility #1704

Closed rokopi-byte closed 1 year ago

rokopi-byte commented 1 year ago

Hi, I'm trying to use the following image:

nvidia/cuda:11.7.1-cudnn8-runtime-ubuntu22.04

On a Amazon EC2 instance of type p2.xlarge (basically a single Tesla K80 GPU) running Ubuntu 22.04.

I installed nvidia driver using package nvidia-driver-470 from the distribution repository. I choose version 470 because it's the latest available driver for K80 according to this. Then I did all the steps mentioned on the nvidia docker container toolkit and everything was working fine. I then built a container using image nvidia/cuda:11.7.1-cudnn8-runtime-ubuntu22.04 but I got this errors (dots are mine, cannot remember exact numbers):

libcuda reported version is: 470.. kernel reported version is: 515.. kernel version 470... does not match DSO version 515... -- cannot find working devices in this configuration

Then I tried instead image nvidia/cuda:11.4.3-cudnn8-runtime-ubuntu20.04 and everything is working. I suppose my error come from the fact that cuda 11.7 has driver 515 packaged and for this reason the container cannot communicate with the host GPU driver which is 470 (shouldn't be backward compatible ?). While CUDA 11.4 comes with driver 470, which is the same major version of the host machine (although minor version is different). Is this correct ? However I was not able to find this limitation in the documentation, can someone point me in the right direction?

klueska commented 1 year ago

Try setting -e NVIDIA_DISABLE_REQUIRE=true when running your container. This is likely the culprit.

Independent of that, I'm not sure if this is related -- and it's a bit awkward -- but you can't build a container while having the nvidia runtime set as the default runtime in docker. If you do, you will end up with "ghost" versions of all of your host's driver libraries embedded in the container image (which you don't want).

rokopi-byte commented 1 year ago

I didn't get exactly what you mean in your second point.. it is not what is exactly done in the documentation?

sudo docker run --rm --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi

It's a container build having nvidia runtime set as default runtime. I'm just realizing that this works in my system, even it should not based on my errors, because cuda 11.6 has driver 510 packages which mismatch with my 470.

I'm confused. Maybe it's because I'm using docker compose? I have this section in my yml file:

deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          count: 1
          capabilities: [gpu]
klueska commented 1 year ago

By build a container, I mean running docker build with nvidia set as the default runtime.

However, it looks like you are just running a container (not building one), so it should be fine.

Regarding:

It's a container build having nvidia runtime set as default runtime. I'm just realizing that this works in my system, even it should not based on my errors, because cuda 11.6 has driver 510 packages which mismatch with my 470.

None of the driver packages are built into the cuda containers, so I would expect your setup to work just fine (the 470 driver libraries will be injected into the container at runtime, and the 510 cuda libraries should be compatible with that).

rokopi-byte commented 1 year ago

Hi, I started from scratch from a fresh Ubuntu 22.04 installation. Installed driver nvidia-driver-470, run nvidia-smi, GPU recognized. Installed Docker. Then I used the official docker documentation to install nvidia-container-runtime run the test on that page and everything is working. This is the output of docker info:


Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Docker Buildx (Docker Inc., v0.9.1-docker)
  compose: Docker Compose (Docker Inc., v2.12.2)
  scan: Docker Scan (Docker Inc., v0.21.0)

Server:
 Containers: 6
  Running: 0
  Paused: 0
  Stopped: 6
 Images: 7
 Server Version: 20.10.21
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 770bd0108c32f3fb5c73ae1264f7e503fe7b2661
 runc version: v1.1.4-0-g5fd4c4d
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.15.0-1023-aws
 Operating System: Ubuntu 22.04.1 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 59.85GiB
 Name: ip-172-31-4-132
 ID: MIKC:JDK6:BCHB:AUH7:BPT3:RIE7:KLS6:Z75F:WYHP:QKUU:JM34:2EVO
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Nvidia runtime is not even present (on my previous attempts it was present, but default runtime was runc).

Running nvidia-smi inside the container I get:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   28C    P0    72W / 149W |      0MiB / 11441MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Anyway I get the same result from tensorflow:

kernel version 470.141.3 does not match DSO version 515.65.7

Using -e NVIDIA_DISABLE_REQUIRE=true does not help. What I noticed is that inside the container I have:

/usr/lib/x86_64-linux-gnu/libcuda.so.470.141.03
/usr/lib/x86_64-linux-gnu/libcuda.so.1
/usr/lib/x86_64-linux-gnu/libcuda.so
/usr/lib/x86_64-linux-gnu/libcuda.so.515.65.07

With the following symlinks: libcuda.so -> libcuda.so.1 -> libcuda.so.515.65.07 Is this correct ?

rokopi-byte commented 1 year ago

I think this could be the reason .. any idea?