Open Golchoubian opened 2 years ago
@Golchoubian could you enable debug logging in the NVIDIA Container CLI by uncommenting the line:
#debug = "/var/log/nvidia-container-toolkit.log"
in /etc/nvidia-container-runtime/config.toml
, repeating the failed run, and then attaching the /var/log/nvidia-container-toolkit.log
here. From the message it would seem as if the ldcache is not being updated correctly in the container.
We could also run the following to confirm:
docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 bash -c "ldconfig; nvidia-smi"
Could you provide more information on your host system and whether something might be preventing /sbin/ldconfig
from being run by the NVIDIA Container Runtime Hook?
@elezar Despite uncommenting the debug line that you mentioned, no nvidia-container-toolkit.log was created using the failed run, unless I again use the sudo with it, which afterwards gives me the following attached file. nvidia-container-toolkit.log
When I run the second command that you shared, I get the same error as follows:
$ docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 bash -c "ldconfig; nvidia-smi" docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
here is more more information on my system:
Linux mahsa 5.13.0-48-generic NVIDIA/nvidia-docker#54~20.04.1-Ubuntu SMP Thu Jun 2 23:37:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
@elezar
I don't quite have the full story as I'm unfamiliar with the details of NVIDIA container runtime's architecture, but I was seeing the same behavior (could run nvidia-smi
in containers with sudo
but not without, with the same load library failed: libnvidia-ml.so.1
error) and can add the following that applied at least in my case.
Importantly, I had installed Docker Desktop (on Ubuntu 22.04), which set the docker CLI's current context to use unix:///home/$USER/.docker/desktop/docker.sock
as the docker endpoint:
❯ docker context inspect $(docker context show)
[
{
"Name": "desktop-linux",
"Metadata": {},
"Endpoints": {
"docker": {
"Host": "unix:///home/$USER/.docker/desktop/docker.sock",
"SkipTLSVerify": false
}
},
"TLSMaterial": {},
"Storage": {
"MetadataPath": "/home/$USER/.docker/contexts/meta/fe9c6bd7a66301f49ca9b6a70b217107cd1284598bfc254700c989b916da791e",
"TLSPath": "/home/$USER/.docker/contexts/tls/fe9c6bd7a66301f49ca9b6a70b217107cd1284598bfc254700c989b916da791e"
}
}
]
After switching back to the default context:
docker context use default
This set the docker endpoint back to unix:///var/run/docker.sock
And I was able to run:
docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
as expected.
So it seems that the docker desktop docker host is somehow interfering here, at least in my case.
Happy to provide more information here in case it helps.
I installed the nvidia-docker2 following the instructions. When running the following command I will get the expected output as shown.
However running the above command without "sudo" results in the following error for me:
Here is some additional information regarding my issue:
Can you please instruct me on how I can solve this?