Open howtoadd opened 1 year ago
I feel there are some issue with PGP, i use 【sudo yum install --nogpgcheck -y nvidia-container-toolkit】 without pgpcheck to try. But seems it does't work with amzn2
sudo yum install --nogpgcheck -y nvidia-container-toolkit
Loaded plugins: dkms-build-requires, nvidia, priorities, update-motd, versionlock
amzn2-core | 3.6 kB 00:00:00
amzn2-nvidia | 2.6 kB 00:00:00
neuron | 2.9 kB 00:00:00
nvidia-container-toolkit | 2.1 kB 00:00:00
nvidia-container-toolkit/x86_64/primary | 5.8 kB 00:00:00
nvidia-container-toolkit 30/30
15 packages excluded due to repository priority protections
Resolving Dependencies
--> Running transaction check
---> Package nvidia-container-runtime-hook.x86_64 0:1.4.0-1.amzn2 will be obsoleted
---> Package nvidia-container-toolkit.x86_64 0:1.14.3-1 will be obsoleting
--> Processing Dependency: nvidia-container-toolkit-base = 1.14.3-1 for package: nvidia-container-toolkit-1.14.3-1.x86_64
--> Processing Dependency: libnvidia-container-tools >= 1.14.3-1 for package: nvidia-container-toolkit-1.14.3-1.x86_64
--> Running transaction check
---> Package nvidia-container-toolkit.x86_64 0:1.14.3-1 will be obsoleting
--> Processing Dependency: libnvidia-container-tools >= 1.14.3-1 for package: nvidia-container-toolkit-1.14.3-1.x86_64
---> Package nvidia-container-toolkit-base.x86_64 0:1.14.3-1 will be obsoleting
--> Finished Dependency Resolution
Error: Package: nvidia-container-toolkit-1.14.3-1.x86_64 (nvidia-container-toolkit)
Requires: libnvidia-container-tools >= 1.14.3-1
Installed: libnvidia-container-tools-1.4.0-1.amzn2.x86_64 (@amzn2-nvidia)
libnvidia-container-tools = 1.4.0-1.amzn2
Available: libnvidia-container-tools-1.0.0-2.amzn2.x86_64 (amzn2-nvidia)
libnvidia-container-tools = 1.0.0-2.amzn2
You could try using --skip-broken to work around the problem
You could try running: rpm -Va --nofiles --nodigest
I have just done the following:
$ docker run --rm -ti amazonlinux:2
bash-4.2# curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
> sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
bash: sudo: command not found
bash-4.2# curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | tee /etc/yum.repos.d/nvidia-container-toolkit.repo
[nvidia-container-toolkit]
name=nvidia-container-toolkit
baseurl=https://nvidia.github.io/libnvidia-container/stable/rpm/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=1
gpgkey=https://nvidia.github.io/libnvidia-container/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
[nvidia-container-toolkit-experimental]
name=nvidia-container-toolkit-experimental
baseurl=https://nvidia.github.io/libnvidia-container/experimental/rpm/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=0
gpgkey=https://nvidia.github.io/libnvidia-container/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
bash-4.2# yum install -y nvidia-container-toolkit
Loaded plugins: ovl, priorities
amzn2-core | 3.6 kB 00:00:00
nvidia-container-toolkit/x86_64/signature | 833 B 00:00:00
Retrieving key from https://nvidia.github.io/libnvidia-container/gpgkey
Importing GPG key 0xF796ECB0:
Userid : "NVIDIA CORPORATION (Open Source Projects) <cudatools@nvidia.com>"
Fingerprint: c95b 321b 61e8 8c18 09c4 f759 ddca e044 f796 ecb0
From : https://nvidia.github.io/libnvidia-container/gpgkey
nvidia-container-toolkit/x86_64/signature | 2.1 kB 00:00:01 !!!
(1/4): amzn2-core/2/x86_64/group_gz | 2.7 kB 00:00:00
(2/4): amzn2-core/2/x86_64/updateinfo | 737 kB 00:00:00
(3/4): nvidia-container-toolkit/x86_64/primary | 5.8 kB 00:00:00
(4/4): amzn2-core/2/x86_64/primary_db | 67 MB 00:00:05
nvidia-container-toolkit 30/30
Resolving Dependencies
--> Running transaction check
...
Installed:
nvidia-container-toolkit.x86_64 0:1.14.3-1
Dependency Installed:
libnvidia-container-tools.x86_64 0:1.14.3-1 libnvidia-container1.x86_64 0:1.14.3-1 libseccomp.x86_64 0:2.4.1-1.amzn2 nvidia-container-toolkit-base.x86_64 0:1.14.3-1
Complete!
Could you please confirm what differences there are between your AMI and the amazonlinux:2
images -- specifically in terms of yum version and gpg libraries.
there are conflicts with amz2, i removed lower version of libnvidia-container and installed again, installed libnvidia-container-tools-1.14.3-1, yum install nvidia-container-toolkit succeed.
I can get proper outputs while running 【sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi】
after set 【sudo nvidia-ctk runtime configure --runtime=containerd】 and 【sudo systemctl restart containerd】
Warning FailedCreatePodSandBox 66s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v1.linux/k8s.io/5065fbc879a35b5bb452fac30f22ec0a7e26288bcee135009f17c642ccb62b64/log.json: no such file or directory): fork/exec /etc/docker-runtimes.d/nvidia: no such file or directory: unknown Warning FailedCreatePodSandBox 52s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v1.linux/k8s.io/6213590deca8446895cba9aa7125576f7576100a2d2e6858daaf928e0d5a3bd9/log.json: no such file or directory): fork/exec /etc/docker-runtimes.d/nvidia: no such file or directory: unknown Warning FailedCreatePodSandBox 39s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v1.linux/k8s.io/3dcd809e48773bcaa3ffab9a8e53baa8ac27cf2f3546213129cdf9688da3e58d/log.json: no such file or directory): fork/exec /etc/docker-runtimes.d/nvidia: no such file or directory: unknown Warning FailedCreatePodSandBox 26s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v1.linux/k8s.io/0d1555ee43abe7096fb634bc8478601d0439ab93a15c3a4cc87aa302e2c23443/log.json: no such file or directory): fork/exec /etc/docker-runtimes.d/nvidia: no such file or directory: unknown Warning FailedCreatePodSandBox 14s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v1.linux/k8s.io/580f049163000809cf4960fbac86434264016e12701e32a87c0ab4f4072faede/log.json: no such file or directory): fork/exec /etc/docker-runtimes.d/nvidia: no such file or directory: unknown Warning FailedToRetrieveImagePullSecret 3s (x6 over 66s) kubelet Unable to retrieve some image pull secrets (anquankeji); attempting to pull the image may not succeed. Warning FailedCreatePodSandBox 3s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v1.linux/k8s.io/4a89d08029ea2c7b6918d50798e1ecefda5f63db7897c2c4417b70834d8de5f6/log.json: no such file or directory): fork/exec /etc/docker-runtimes.d/nvidia: no such file or directory: unknown
Here it seems as if containerd cannot find the nvidia runtime as configured. Could you provide the containerd config after the modifications to add the nvidia runtime were applied?
Hi,
I ran into this issue, the fix for me was to run...
sudo yum-config-manager --disable amzn2-graphics
If you install the container toolkit following the NVIDIA instructions after doing this you should be in a better position.
Managed to find this here...
@killmepete thanks, that helped me instantly!
I fixed this with:
sudo yum-config-manager --disable amzn2-graphics
sudo yum erase -y libnvidia-container
sudo yum install -y nvidia-container-toolkit
sudo yum-config-manager --enable amzn2-graphics
sudo yum install -y docker-runtime-nvidia
It removed the whole package from Amazon:
$ sudo yum erase -y libnvidia-container
Failed to set locale, defaulting to C
Loaded plugins: dkms-build-requires, nvidia, priorities, update-motd, upgrade-helper, versionlock
Resolving Dependencies
--> Running transaction check
---> Package libnvidia-container.x86_64 0:1.4.0-1.amzn2 will be erased
--> Processing Dependency: libnvidia-container.so.1()(64bit) for package: libnvidia-container-tools-1.4.0-1.amzn2.x86_64
--> Processing Dependency: libnvidia-container.so.1(NVC_1.0)(64bit) for package: libnvidia-container-tools-1.4.0-1.amzn2.x86_64
--> Running transaction check
---> Package libnvidia-container-tools.x86_64 0:1.4.0-1.amzn2 will be erased
--> Processing Dependency: libnvidia-container-tools for package: nvidia-container-runtime-hook-1.4.0-1.amzn2.x86_64
--> Running transaction check
---> Package nvidia-container-runtime-hook.x86_64 0:1.4.0-1.amzn2 will be erased
--> Processing Dependency: nvidia-container-runtime-hook for package: docker-runtime-nvidia-1-2.amzn2.noarch
--> Running transaction check
---> Package docker-runtime-nvidia.noarch 0:1-2.amzn2 will be erased
--> Finished Dependency Resolution
Dependencies Resolved
=======================================================================================================================================================
Package Arch Version Repository Size
=======================================================================================================================================================
Removing:
libnvidia-container x86_64 1.4.0-1.amzn2 @amzn2-nvidia 243 k
Removing for dependencies:
docker-runtime-nvidia noarch 1-2.amzn2 @amzn2-nvidia 435
libnvidia-container-tools x86_64 1.4.0-1.amzn2 @amzn2-nvidia 113 k
nvidia-container-runtime-hook x86_64 1.4.0-1.amzn2 @amzn2-nvidia 1.8 M
Transaction Summary
=======================================================================================================================================================
Remove 1 Package (+3 Dependent packages)
Installed size: 2.2 M
Downloading packages:
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
Erasing : docker-runtime-nvidia-1-2.amzn2.noarch 1/4
Erasing : nvidia-container-runtime-hook-1.4.0-1.amzn2.x86_64 2/4
Erasing : libnvidia-container-tools-1.4.0-1.amzn2.x86_64 3/4
Erasing : libnvidia-container-1.4.0-1.amzn2.x86_64 4/4
Verifying : libnvidia-container-1.4.0-1.amzn2.x86_64 1/4
Verifying : libnvidia-container-tools-1.4.0-1.amzn2.x86_64 2/4
Verifying : nvidia-container-runtime-hook-1.4.0-1.amzn2.x86_64 3/4
Verifying : docker-runtime-nvidia-1-2.amzn2.noarch 4/4
Removed:
libnvidia-container.x86_64 0:1.4.0-1.amzn2
Dependency Removed:
docker-runtime-nvidia.noarch 0:1-2.amzn2 libnvidia-container-tools.x86_64 0:1.4.0-1.amzn2 nvidia-container-runtime-hook.x86_64 0:1.4.0-1.amzn2
Complete!
I am using AWS EC2(Tesla T4) , I think nvidia diver has been installed by default.
Run nvidia-smi get proper outputs.
Thu Nov 9 07:36:11 2023
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 | | N/A 30C P8 10W / 70W | 2MiB / 15360MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+
System Info
cat /etc/os-release
NAME="Amazon Linux" VERSION="2" ID="amzn" ID_LIKE="centos rhel fedora" VERSION_ID="2" PRETTY_NAME="Amazon Linux 2" ANSI_COLOR="0;33" CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2" HOME_URL="https://amazonlinux.com/" SUPPORT_END="2025-06-30"
But When I install nvidia-container-toolkit got errors !!!!
install nvidia-container-toolkit following the guide
step 1 (success) : curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | > sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
outputs:
[nvidia-container-toolkit] name=nvidia-container-toolkit baseurl=https://nvidia.github.io/libnvidia-container/stable/rpm/$basearch repo_gpgcheck=1 gpgcheck=0 enabled=1 gpgkey=https://nvidia.github.io/libnvidia-container/gpgkey sslverify=1 sslcacert=/etc/pki/tls/certs/ca-bundle.crt
[nvidia-container-toolkit-experimental] name=nvidia-container-toolkit-experimental baseurl=https://nvidia.github.io/libnvidia-container/experimental/rpm/$basearch repo_gpgcheck=1 gpgcheck=0 enabled=0 gpgkey=https://nvidia.github.io/libnvidia-container/gpgkey sslverify=1 sslcacert=/etc/pki/tls/certs/ca-bundle.crt
step 2(failed): sudo yum install -y nvidia-container-toolkit
Loaded plugins: dkms-build-requires, nvidia, priorities, update-motd, versionlock neuron | 2.9 kB 00:00:00
nvidia-container-toolkit/x86_64/signature | 833 B 00:00:00
Retrieving key from https://nvidia.github.io/libnvidia-container/gpgkey nvidia-container-toolkit/x86_64/signature | 2.1 kB 00:00:02 !!! Traceback (most recent call last): File "/bin/yum", line 29, in
yummain.user_main(sys.argv[1:], exit_code=True)
File "/usr/share/yum-cli/yummain.py", line 375, in user_main
errcode = main(args)
File "/usr/share/yum-cli/yummain.py", line 184, in main
result, resultmsgs = base.doCommands()
File "/usr/share/yum-cli/cli.py", line 584, in doCommands
return self.yum_cli_commands[self.basecmd].doCommand(self, self.basecmd, self.extcmds)
File "/usr/share/yum-cli/yumcommands.py", line 446, in doCommand
return base.installPkgs(extcmds, basecmd=basecmd)
File "/usr/share/yum-cli/cli.py", line 1016, in installPkgs
txmbrs = self.install(pattern=arg)
File "/usr/lib/python2.7/site-packages/yum/init.py", line 4827, in install
mypkgs = self.pkgSack.returnPackages(patterns=pats,
File "/usr/lib/python2.7/site-packages/yum/init.py", line 1074, in
pkgSack = property(fget=lambda self: self._getSacks(),
File "/usr/lib/python2.7/site-packages/yum/init.py", line 778, in _getSacks
self.repos.populateSack(which=repos)
File "/usr/lib/python2.7/site-packages/yum/repos.py", line 347, in populateSack
self.doSetup()
File "/usr/lib/python2.7/site-packages/yum/repos.py", line 157, in doSetup
self.retrieveAllMD()
File "/usr/lib/python2.7/site-packages/yum/repos.py", line 88, in retrieveAllMD
dl = repo._async and repo._commonLoadRepoXML(repo)
File "/usr/lib/python2.7/site-packages/yum/yumRepo.py", line 1553, in _commonLoadRepoXML
result = self._getFileRepoXML(local, text)
File "/usr/lib/python2.7/site-packages/yum/yumRepo.py", line 1330, in _getFileRepoXML
size=102400) # setting max size as 100K
File "/usr/lib/python2.7/site-packages/yum/yumRepo.py", line 1105, in _getFile
kwargs
File "/usr/lib/python2.7/site-packages/urlgrabber/mirror.py", line 448, in urlgrab
return self._mirror_try(func, url, kw)
File "/usr/lib/python2.7/site-packages/urlgrabber/mirror.py", line 425, in _mirror_try
return func_ref( *(fullurl,), opts=opts, *kw )
File "/usr/lib/python2.7/site-packages/urlgrabber/grabber.py", line 1216, in urlgrab
return self._retry(opts, retryfunc, url, filename)
File "/usr/lib/python2.7/site-packages/urlgrabber/grabber.py", line 1105, in _retry
r = apply(func, (opts,) + args, {})
File "/usr/lib/python2.7/site-packages/urlgrabber/grabber.py", line 1210, in retryfunc
_run_callback(opts.checkfunc, obj)
File "/usr/lib/python2.7/site-packages/urlgrabber/grabber.py", line 1073, in _run_callback
return cb(obj, arg, karg)
File "/usr/lib/python2.7/site-packages/yum/yumRepo.py", line 1802, in _checkRepoXML
self.gpg_import_func(self, self.confirm_func)
File "/usr/lib/python2.7/site-packages/yum/init.py", line 6420, in getKeyForRepo
self._getAnyKeyForRepo(repo, repo.gpgdir, repo.gpgkey, is_cakey=False, callback=callback)
File "/usr/lib/python2.7/site-packages/yum/init.py", line 6339, in _getAnyKeyForRepo
if hex(int(info['keyid']))[2:-1].upper() in misc.return_keyids_from_pubring(destdir):
File "/usr/lib/python2.7/site-packages/yum/misc.py", line 623, in return_keyids_from_pubring
for k in ctx.keylist():
gpgme.GpgmeError: (7, 32848, u'No such device')