Closed rokopi-byte closed 1 year ago
Try setting -e NVIDIA_DISABLE_REQUIRE=true
when running your container. This is likely the culprit.
Independent of that, I'm not sure if this is related -- and it's a bit awkward -- but you can't build a container while having the nvidia
runtime set as the default runtime in docker. If you do, you will end up with "ghost" versions of all of your host's driver libraries embedded in the container image (which you don't want).
I didn't get exactly what you mean in your second point.. it is not what is exactly done in the documentation?
sudo docker run --rm --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
It's a container build having nvidia runtime set as default runtime. I'm just realizing that this works in my system, even it should not based on my errors, because cuda 11.6 has driver 510 packages which mismatch with my 470.
I'm confused. Maybe it's because I'm using docker compose? I have this section in my yml file:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
By build a container, I mean running docker build
with nvidia
set as the default runtime.
However, it looks like you are just running a container (not building one), so it should be fine.
Regarding:
It's a container build having nvidia runtime set as default runtime. I'm just realizing that this works in my system, even it should not based on my errors, because cuda 11.6 has driver 510 packages which mismatch with my 470.
None of the driver packages are built into the cuda
containers, so I would expect your setup to work just fine (the 470 driver libraries will be injected into the container at runtime, and the 510 cuda libraries should be compatible with that).
Hi, I started from scratch from a fresh Ubuntu 22.04 installation.
Installed driver nvidia-driver-470
, run nvidia-smi
, GPU recognized. Installed Docker.
Then I used the official docker documentation to install nvidia-container-runtime
run the test on that page and everything is working.
This is the output of docker info
:
Client:
Context: default
Debug Mode: false
Plugins:
app: Docker App (Docker Inc., v0.9.1-beta3)
buildx: Docker Buildx (Docker Inc., v0.9.1-docker)
compose: Docker Compose (Docker Inc., v2.12.2)
scan: Docker Scan (Docker Inc., v0.21.0)
Server:
Containers: 6
Running: 0
Paused: 0
Stopped: 6
Images: 7
Server Version: 20.10.21
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 770bd0108c32f3fb5c73ae1264f7e503fe7b2661
runc version: v1.1.4-0-g5fd4c4d
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: default
cgroupns
Kernel Version: 5.15.0-1023-aws
Operating System: Ubuntu 22.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 59.85GiB
Name: ip-172-31-4-132
ID: MIKC:JDK6:BCHB:AUH7:BPT3:RIE7:KLS6:Z75F:WYHP:QKUU:JM34:2EVO
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Nvidia runtime is not even present (on my previous attempts it was present, but default runtime was runc).
Running nvidia-smi
inside the container I get:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:1E.0 Off | 0 |
| N/A 28C P0 72W / 149W | 0MiB / 11441MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Anyway I get the same result from tensorflow:
kernel version 470.141.3 does not match DSO version 515.65.7
Using -e NVIDIA_DISABLE_REQUIRE=true
does not help. What I noticed is that inside the container I have:
/usr/lib/x86_64-linux-gnu/libcuda.so.470.141.03
/usr/lib/x86_64-linux-gnu/libcuda.so.1
/usr/lib/x86_64-linux-gnu/libcuda.so
/usr/lib/x86_64-linux-gnu/libcuda.so.515.65.07
With the following symlinks: libcuda.so -> libcuda.so.1 -> libcuda.so.515.65.07 Is this correct ?
I think this could be the reason .. any idea?
Hi, I'm trying to use the following image:
nvidia/cuda:11.7.1-cudnn8-runtime-ubuntu22.04
On a Amazon EC2 instance of type p2.xlarge (basically a single Tesla K80 GPU) running Ubuntu 22.04.
I installed nvidia driver using package
nvidia-driver-470
from the distribution repository. I choose version 470 because it's the latest available driver for K80 according to this. Then I did all the steps mentioned on the nvidia docker container toolkit and everything was working fine. I then built a container using imagenvidia/cuda:11.7.1-cudnn8-runtime-ubuntu22.04
but I got this errors (dots are mine, cannot remember exact numbers):libcuda reported version is: 470.. kernel reported version is: 515.. kernel version 470... does not match DSO version 515... -- cannot find working devices in this configuration
Then I tried instead image
nvidia/cuda:11.4.3-cudnn8-runtime-ubuntu20.04
and everything is working. I suppose my error come from the fact that cuda 11.7 has driver 515 packaged and for this reason the container cannot communicate with the host GPU driver which is 470 (shouldn't be backward compatible ?). While CUDA 11.4 comes with driver 470, which is the same major version of the host machine (although minor version is different). Is this correct ? However I was not able to find this limitation in the documentation, can someone point me in the right direction?