NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.45k stars 261 forks source link

sudo yum install -y nvidia-container-toolkit failed - No such device #145

Open howtoadd opened 1 year ago

howtoadd commented 1 year ago

I am using AWS EC2(Tesla T4) , I think nvidia diver has been installed by default.

Run nvidia-smi get proper outputs.

Thu Nov 9 07:36:11 2023
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 | | N/A 30C P8 10W / 70W | 2MiB / 15360MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+

System Info

cat /etc/os-release

NAME="Amazon Linux" VERSION="2" ID="amzn" ID_LIKE="centos rhel fedora" VERSION_ID="2" PRETTY_NAME="Amazon Linux 2" ANSI_COLOR="0;33" CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2" HOME_URL="https://amazonlinux.com/" SUPPORT_END="2025-06-30"

But When I install nvidia-container-toolkit got errors !!!!

install nvidia-container-toolkit following the guide

step 1 (success) : curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | > sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo

outputs:

[nvidia-container-toolkit] name=nvidia-container-toolkit baseurl=https://nvidia.github.io/libnvidia-container/stable/rpm/$basearch repo_gpgcheck=1 gpgcheck=0 enabled=1 gpgkey=https://nvidia.github.io/libnvidia-container/gpgkey sslverify=1 sslcacert=/etc/pki/tls/certs/ca-bundle.crt

[nvidia-container-toolkit-experimental] name=nvidia-container-toolkit-experimental baseurl=https://nvidia.github.io/libnvidia-container/experimental/rpm/$basearch repo_gpgcheck=1 gpgcheck=0 enabled=0 gpgkey=https://nvidia.github.io/libnvidia-container/gpgkey sslverify=1 sslcacert=/etc/pki/tls/certs/ca-bundle.crt

step 2(failed): sudo yum install -y nvidia-container-toolkit

Loaded plugins: dkms-build-requires, nvidia, priorities, update-motd, versionlock neuron | 2.9 kB 00:00:00
nvidia-container-toolkit/x86_64/signature | 833 B 00:00:00

Retrieving key from https://nvidia.github.io/libnvidia-container/gpgkey nvidia-container-toolkit/x86_64/signature | 2.1 kB 00:00:02 !!! Traceback (most recent call last): File "/bin/yum", line 29, in yummain.user_main(sys.argv[1:], exit_code=True) File "/usr/share/yum-cli/yummain.py", line 375, in user_main errcode = main(args) File "/usr/share/yum-cli/yummain.py", line 184, in main result, resultmsgs = base.doCommands() File "/usr/share/yum-cli/cli.py", line 584, in doCommands return self.yum_cli_commands[self.basecmd].doCommand(self, self.basecmd, self.extcmds) File "/usr/share/yum-cli/yumcommands.py", line 446, in doCommand return base.installPkgs(extcmds, basecmd=basecmd) File "/usr/share/yum-cli/cli.py", line 1016, in installPkgs txmbrs = self.install(pattern=arg) File "/usr/lib/python2.7/site-packages/yum/init.py", line 4827, in install mypkgs = self.pkgSack.returnPackages(patterns=pats, File "/usr/lib/python2.7/site-packages/yum/init.py", line 1074, in pkgSack = property(fget=lambda self: self._getSacks(), File "/usr/lib/python2.7/site-packages/yum/init.py", line 778, in _getSacks self.repos.populateSack(which=repos) File "/usr/lib/python2.7/site-packages/yum/repos.py", line 347, in populateSack self.doSetup() File "/usr/lib/python2.7/site-packages/yum/repos.py", line 157, in doSetup self.retrieveAllMD() File "/usr/lib/python2.7/site-packages/yum/repos.py", line 88, in retrieveAllMD dl = repo._async and repo._commonLoadRepoXML(repo) File "/usr/lib/python2.7/site-packages/yum/yumRepo.py", line 1553, in _commonLoadRepoXML result = self._getFileRepoXML(local, text) File "/usr/lib/python2.7/site-packages/yum/yumRepo.py", line 1330, in _getFileRepoXML size=102400) # setting max size as 100K File "/usr/lib/python2.7/site-packages/yum/yumRepo.py", line 1105, in _getFile kwargs File "/usr/lib/python2.7/site-packages/urlgrabber/mirror.py", line 448, in urlgrab return self._mirror_try(func, url, kw) File "/usr/lib/python2.7/site-packages/urlgrabber/mirror.py", line 425, in _mirror_try return func_ref( *(fullurl,), opts=opts, *kw ) File "/usr/lib/python2.7/site-packages/urlgrabber/grabber.py", line 1216, in urlgrab return self._retry(opts, retryfunc, url, filename) File "/usr/lib/python2.7/site-packages/urlgrabber/grabber.py", line 1105, in _retry r = apply(func, (opts,) + args, {}) File "/usr/lib/python2.7/site-packages/urlgrabber/grabber.py", line 1210, in retryfunc _run_callback(opts.checkfunc, obj) File "/usr/lib/python2.7/site-packages/urlgrabber/grabber.py", line 1073, in _run_callback return cb(obj, arg, karg) File "/usr/lib/python2.7/site-packages/yum/yumRepo.py", line 1802, in _checkRepoXML self.gpg_import_func(self, self.confirm_func) File "/usr/lib/python2.7/site-packages/yum/init.py", line 6420, in getKeyForRepo self._getAnyKeyForRepo(repo, repo.gpgdir, repo.gpgkey, is_cakey=False, callback=callback) File "/usr/lib/python2.7/site-packages/yum/init.py", line 6339, in _getAnyKeyForRepo if hex(int(info['keyid']))[2:-1].upper() in misc.return_keyids_from_pubring(destdir): File "/usr/lib/python2.7/site-packages/yum/misc.py", line 623, in return_keyids_from_pubring for k in ctx.keylist(): gpgme.GpgmeError: (7, 32848, u'No such device')

### Tasks
howtoadd commented 1 year ago

I feel there are some issue with PGP, i use 【sudo yum install --nogpgcheck -y nvidia-container-toolkit】 without pgpcheck to try. But seems it does't work with amzn2

sudo yum install --nogpgcheck -y nvidia-container-toolkit

Loaded plugins: dkms-build-requires, nvidia, priorities, update-motd, versionlock amzn2-core | 3.6 kB 00:00:00
amzn2-nvidia | 2.6 kB 00:00:00
neuron | 2.9 kB 00:00:00
nvidia-container-toolkit | 2.1 kB 00:00:00
nvidia-container-toolkit/x86_64/primary | 5.8 kB 00:00:00
nvidia-container-toolkit 30/30 15 packages excluded due to repository priority protections Resolving Dependencies --> Running transaction check ---> Package nvidia-container-runtime-hook.x86_64 0:1.4.0-1.amzn2 will be obsoleted ---> Package nvidia-container-toolkit.x86_64 0:1.14.3-1 will be obsoleting --> Processing Dependency: nvidia-container-toolkit-base = 1.14.3-1 for package: nvidia-container-toolkit-1.14.3-1.x86_64 --> Processing Dependency: libnvidia-container-tools >= 1.14.3-1 for package: nvidia-container-toolkit-1.14.3-1.x86_64 --> Running transaction check ---> Package nvidia-container-toolkit.x86_64 0:1.14.3-1 will be obsoleting --> Processing Dependency: libnvidia-container-tools >= 1.14.3-1 for package: nvidia-container-toolkit-1.14.3-1.x86_64 ---> Package nvidia-container-toolkit-base.x86_64 0:1.14.3-1 will be obsoleting --> Finished Dependency Resolution Error: Package: nvidia-container-toolkit-1.14.3-1.x86_64 (nvidia-container-toolkit) Requires: libnvidia-container-tools >= 1.14.3-1 Installed: libnvidia-container-tools-1.4.0-1.amzn2.x86_64 (@amzn2-nvidia) libnvidia-container-tools = 1.4.0-1.amzn2 Available: libnvidia-container-tools-1.0.0-2.amzn2.x86_64 (amzn2-nvidia) libnvidia-container-tools = 1.0.0-2.amzn2 You could try using --skip-broken to work around the problem You could try running: rpm -Va --nofiles --nodigest

elezar commented 1 year ago

I have just done the following:

$  docker run --rm -ti amazonlinux:2
bash-4.2# curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
>   sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
bash: sudo: command not found
bash-4.2# curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | tee /etc/yum.repos.d/nvidia-container-toolkit.repo
[nvidia-container-toolkit]
name=nvidia-container-toolkit
baseurl=https://nvidia.github.io/libnvidia-container/stable/rpm/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=1
gpgkey=https://nvidia.github.io/libnvidia-container/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt

[nvidia-container-toolkit-experimental]
name=nvidia-container-toolkit-experimental
baseurl=https://nvidia.github.io/libnvidia-container/experimental/rpm/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=0
gpgkey=https://nvidia.github.io/libnvidia-container/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
bash-4.2# yum install -y nvidia-container-toolkit
Loaded plugins: ovl, priorities
amzn2-core                                                                                                                                                                                                                         | 3.6 kB  00:00:00
nvidia-container-toolkit/x86_64/signature                                                                                                                                                                                          |  833 B  00:00:00
Retrieving key from https://nvidia.github.io/libnvidia-container/gpgkey
Importing GPG key 0xF796ECB0:
 Userid     : "NVIDIA CORPORATION (Open Source Projects) <cudatools@nvidia.com>"
 Fingerprint: c95b 321b 61e8 8c18 09c4 f759 ddca e044 f796 ecb0
 From       : https://nvidia.github.io/libnvidia-container/gpgkey
nvidia-container-toolkit/x86_64/signature                                                                                                                                                                                          | 2.1 kB  00:00:01 !!!
(1/4): amzn2-core/2/x86_64/group_gz                                                                                                                                                                                                | 2.7 kB  00:00:00
(2/4): amzn2-core/2/x86_64/updateinfo                                                                                                                                                                                              | 737 kB  00:00:00
(3/4): nvidia-container-toolkit/x86_64/primary                                                                                                                                                                                     | 5.8 kB  00:00:00
(4/4): amzn2-core/2/x86_64/primary_db                                                                                                                                                                                              |  67 MB  00:00:05
nvidia-container-toolkit                                                                                                                                                                                                                            30/30
Resolving Dependencies
--> Running transaction check
...

Installed:
  nvidia-container-toolkit.x86_64 0:1.14.3-1

Dependency Installed:
  libnvidia-container-tools.x86_64 0:1.14.3-1                      libnvidia-container1.x86_64 0:1.14.3-1                      libseccomp.x86_64 0:2.4.1-1.amzn2                      nvidia-container-toolkit-base.x86_64 0:1.14.3-1

Complete!

Could you please confirm what differences there are between your AMI and the amazonlinux:2 images -- specifically in terms of yum version and gpg libraries.

howtoadd commented 1 year ago

there are conflicts with amz2, i removed lower version of libnvidia-container and installed again, installed libnvidia-container-tools-1.14.3-1, yum install nvidia-container-toolkit succeed.

I can get proper outputs while running 【sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi】

after set 【sudo nvidia-ctk runtime configure --runtime=containerd】 and 【sudo systemctl restart containerd】

I got error when I start my pod via amz eks, shouldn't【/etc/docker-runtimes.d/nvidia】 be installed by 【yum install -y nvidia-container-toolkit】??

Warning FailedCreatePodSandBox 66s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v1.linux/k8s.io/5065fbc879a35b5bb452fac30f22ec0a7e26288bcee135009f17c642ccb62b64/log.json: no such file or directory): fork/exec /etc/docker-runtimes.d/nvidia: no such file or directory: unknown Warning FailedCreatePodSandBox 52s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v1.linux/k8s.io/6213590deca8446895cba9aa7125576f7576100a2d2e6858daaf928e0d5a3bd9/log.json: no such file or directory): fork/exec /etc/docker-runtimes.d/nvidia: no such file or directory: unknown Warning FailedCreatePodSandBox 39s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v1.linux/k8s.io/3dcd809e48773bcaa3ffab9a8e53baa8ac27cf2f3546213129cdf9688da3e58d/log.json: no such file or directory): fork/exec /etc/docker-runtimes.d/nvidia: no such file or directory: unknown Warning FailedCreatePodSandBox 26s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v1.linux/k8s.io/0d1555ee43abe7096fb634bc8478601d0439ab93a15c3a4cc87aa302e2c23443/log.json: no such file or directory): fork/exec /etc/docker-runtimes.d/nvidia: no such file or directory: unknown Warning FailedCreatePodSandBox 14s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v1.linux/k8s.io/580f049163000809cf4960fbac86434264016e12701e32a87c0ab4f4072faede/log.json: no such file or directory): fork/exec /etc/docker-runtimes.d/nvidia: no such file or directory: unknown Warning FailedToRetrieveImagePullSecret 3s (x6 over 66s) kubelet Unable to retrieve some image pull secrets (anquankeji); attempting to pull the image may not succeed. Warning FailedCreatePodSandBox 3s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v1.linux/k8s.io/4a89d08029ea2c7b6918d50798e1ecefda5f63db7897c2c4417b70834d8de5f6/log.json: no such file or directory): fork/exec /etc/docker-runtimes.d/nvidia: no such file or directory: unknown

elezar commented 1 year ago

Here it seems as if containerd cannot find the nvidia runtime as configured. Could you provide the containerd config after the modifications to add the nvidia runtime were applied?

killmepete commented 12 months ago

Hi,

I ran into this issue, the fix for me was to run...

sudo yum-config-manager --disable amzn2-graphics

If you install the container toolkit following the NVIDIA instructions after doing this you should be in a better position.

Managed to find this here...

https://github.com/NVIDIA/nvidia-docker/issues/1310

mrEuler commented 10 months ago

@killmepete thanks, that helped me instantly!

NayamAmarshe commented 2 months ago

I fixed this with:

sudo yum-config-manager --disable amzn2-graphics

sudo yum erase -y libnvidia-container

sudo yum install -y nvidia-container-toolkit

sudo yum-config-manager --enable amzn2-graphics

sudo yum install -y docker-runtime-nvidia

It removed the whole package from Amazon:

$ sudo yum erase -y libnvidia-container
Failed to set locale, defaulting to C
Loaded plugins: dkms-build-requires, nvidia, priorities, update-motd, upgrade-helper, versionlock
Resolving Dependencies
--> Running transaction check
---> Package libnvidia-container.x86_64 0:1.4.0-1.amzn2 will be erased
--> Processing Dependency: libnvidia-container.so.1()(64bit) for package: libnvidia-container-tools-1.4.0-1.amzn2.x86_64
--> Processing Dependency: libnvidia-container.so.1(NVC_1.0)(64bit) for package: libnvidia-container-tools-1.4.0-1.amzn2.x86_64
--> Running transaction check
---> Package libnvidia-container-tools.x86_64 0:1.4.0-1.amzn2 will be erased
--> Processing Dependency: libnvidia-container-tools for package: nvidia-container-runtime-hook-1.4.0-1.amzn2.x86_64
--> Running transaction check
---> Package nvidia-container-runtime-hook.x86_64 0:1.4.0-1.amzn2 will be erased
--> Processing Dependency: nvidia-container-runtime-hook for package: docker-runtime-nvidia-1-2.amzn2.noarch
--> Running transaction check
---> Package docker-runtime-nvidia.noarch 0:1-2.amzn2 will be erased
--> Finished Dependency Resolution

Dependencies Resolved

=======================================================================================================================================================
 Package                                          Arch                      Version                             Repository                        Size
=======================================================================================================================================================
Removing:
 libnvidia-container                              x86_64                    1.4.0-1.amzn2                       @amzn2-nvidia                    243 k
Removing for dependencies:
 docker-runtime-nvidia                            noarch                    1-2.amzn2                           @amzn2-nvidia                    435  
 libnvidia-container-tools                        x86_64                    1.4.0-1.amzn2                       @amzn2-nvidia                    113 k
 nvidia-container-runtime-hook                    x86_64                    1.4.0-1.amzn2                       @amzn2-nvidia                    1.8 M

Transaction Summary
=======================================================================================================================================================
Remove  1 Package (+3 Dependent packages)

Installed size: 2.2 M
Downloading packages:
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
  Erasing    : docker-runtime-nvidia-1-2.amzn2.noarch                                                                                              1/4 
  Erasing    : nvidia-container-runtime-hook-1.4.0-1.amzn2.x86_64                                                                                  2/4 
  Erasing    : libnvidia-container-tools-1.4.0-1.amzn2.x86_64                                                                                      3/4 
  Erasing    : libnvidia-container-1.4.0-1.amzn2.x86_64                                                                                            4/4 
  Verifying  : libnvidia-container-1.4.0-1.amzn2.x86_64                                                                                            1/4 
  Verifying  : libnvidia-container-tools-1.4.0-1.amzn2.x86_64                                                                                      2/4 
  Verifying  : nvidia-container-runtime-hook-1.4.0-1.amzn2.x86_64                                                                                  3/4 
  Verifying  : docker-runtime-nvidia-1-2.amzn2.noarch                                                                                              4/4 

Removed:
  libnvidia-container.x86_64 0:1.4.0-1.amzn2                                                                                                           

Dependency Removed:
  docker-runtime-nvidia.noarch 0:1-2.amzn2   libnvidia-container-tools.x86_64 0:1.4.0-1.amzn2   nvidia-container-runtime-hook.x86_64 0:1.4.0-1.amzn2  

Complete!