Open joshuacox opened 7 months ago
Of note, I have also tried without KinD and instead using k0s with the exact same result.
Could you confirm that you're able to run nvidia-smi
in the Kind worker node?
I can confirm that it does not run inside kind:
on the bare metal:
nvidia-smi
Tue Jan 23 17:10:33 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 On | N/A |
| 0% 50C P8 12W / 220W | 260MiB / 8192MiB | 5% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3883 G /usr/lib/xorg/Xorg 146MiB |
| 0 N/A N/A 4041 G /usr/bin/gnome-shell 67MiB |
| 0 N/A N/A 6091 G /usr/bin/nautilus 16MiB |
| 0 N/A N/A 78264 G ...b/firefox-esr/firefox-esr 10MiB |
| 0 N/A N/A 702357 G vlc 6MiB |
+-----------------------------------------------------------------------------+
from a container inside of k0s:
k logs nv-5dc699dbc6-xwhwt
==========
== CUDA ==
==========
CUDA Version 12.3.1
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
Use the NVIDIA Container Toolkit to start this container with GPU support; see
https://docs.nvidia.com/datacenter/cloud-native/ .
/opt/nvidia/nvidia_entrypoint.sh: line 67: exec: nvidia-smi: not found
and from inside kind:
k logs nv-5df8456f86-9gkwf
==========
== CUDA ==
==========
CUDA Version 12.3.1
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
Use the NVIDIA Container Toolkit to start this container with GPU support; see
https://docs.nvidia.com/datacenter/cloud-native/ .
/opt/nvidia/nvidia_entrypoint.sh: line 67: exec: nvidia-smi: not found
with this as my deployment:
cat nv-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
kompose.cmd: ./kompose convert -f docker-compose.yml
kompose.version: 1.22.0 (955b78124)
labels:
io.kompose.service: nv
name: nv
spec:
replicas: 1
selector:
matchLabels:
io.kompose.service: nv
template:
metadata:
annotations:
kompose.cmd: ./kompose convert -f docker-compose.yml
kompose.version: 1.22.0 (955b78124)
labels:
io.kompose.network/noworky-default: "true"
io.kompose.service: nv
spec:
containers:
- args:
- nvidia-smi
image: nvidia/cuda:12.3.1-devel-centos7
name: nv
restartPolicy: Always
What are you doing to inject GPU support into the docker container that kind starts to represent the k8s node?
Something like this is necessary: https://github.com/kubernetes-sigs/kind/pull/3257#issuecomment-1607287275
Using the example config you supplied I get the same results:
==========
== CUDA ==
==========
CUDA Version 12.3.1
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
Use the NVIDIA Container Toolkit to start this container with GPU support; see
https://docs.nvidia.com/datacenter/cloud-native/ .
/opt/nvidia/nvidia_entrypoint.sh: line 67: exec: nvidia-smi: not found
I forgot to include that config file:
/etc/nvidia-container-runtime/config.toml
accept-nvidia-visible-devices-as-volume-mounts = true
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
#swarm-resource = "DOCKER_RESOURCE_GPU"
[nvidia-container-cli]
#debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig"
load-kmods = true
#no-cgroups = false
#path = "/usr/bin/nvidia-container-cli"
#root = "/run/nvidia/driver"
#user = "root:video"
[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"
mode = "auto"
runtimes = ["docker-runc", "runc"]
[nvidia-container-runtime.modes]
[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["cdi.k8s.io/"]
default-kind = "nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]
[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"
[nvidia-container-runtime-hook]
path = "nvidia-container-runtime-hook"
skip-mode-detection = false
[nvidia-ctk]
path = "nvidia-ctk"
I even gave that create-cluster.sh script a try:
+++ local 'value=VERSION ?= v0.1.0'
+++ echo v0.1.0
++ DRIVER_IMAGE_VERSION=v0.1.0
++ : k8s-dra-driver
++ : ubuntu20.04
++ : v0.1.0
++ : nvcr.io/nvidia/cloud-native/k8s-dra-driver:v0.1.0
++ : v1.27.1
++ : k8s-dra-driver-cluster
++ : /home/thoth/k8s-dra-driver/demo/clusters/kind/scripts/kind-cluster-config.yaml
++ : v20230515-01914134-containerd_v1.7.1
++ : gcr.io/k8s-staging-kind/base:v20230515-01914134-containerd_v1.7.1
++ : kindest/node:v1.27.1-v20230515-01914134-containerd_v1.7.1
+ kind create cluster --retain --name k8s-dra-driver-cluster --image kindest/node:v1.27.1-v20230515-01914134-containerd_v1.7.1 --config /home/thoth/k8s-dra-driver/demo/clusters/kind/scripts/kind-cluster-config.yaml
Creating cluster "k8s-dra-driver-cluster" ...
✓ Ensuring node image (kindest/node:v1.27.1-v20230515-01914134-containerd_v1.7.1) 🖼
✓ Preparing nodes 📦 📦
✓ Writing configuration 📜
✓ Starting control-plane 🕹️
✓ Installing CNI 🔌
✓ Installing StorageClass 💾
✓ Joining worker nodes 🚜
Set kubectl context to "kind-k8s-dra-driver-cluster"
You can now use your cluster with:
kubectl cluster-info --context kind-k8s-dra-driver-cluster
Thanks for using kind! 😊
+ docker exec -it k8s-dra-driver-cluster-worker umount -R /proc/driver/nvidia
++ docker images --filter reference=nvcr.io/nvidia/cloud-native/k8s-dra-driver:v0.1.0 -q
+ EXISTING_IMAGE_ID=
+ '[' '' '!=' '' ']'
+ set +x
Cluster creation complete: k8s-dra-driver-cluster
Same results though.
appears to be the same issue here https://github.com/NVIDIA/k8s-device-plugin/issues/478
Backing up … what about running with GPUs under docker in general (I.e. without kind).
docker run -e NVIDIA_VISIBLE_DEVICES=all ubuntu:22.04 nvidia-smi
If things are not configured properly to have that work, then kind will not work either.
To be clear, that will work so long as accept-nvidia-visible-devices-as-volume-mounts = false
Once that is configured to true you would need to run:
docker run -v /dev/null:/var/run/nvidia-container-devices/all ubuntu:22.04 nvidia-smi
Both seem to work:
docker run -e NVIDIA_VISIBLE_DEVICES=all ubuntu:22.04 nvidia-smi 24-01-23 - 22:08:54
Wed Jan 24 04:09:07 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 On | N/A |
| 0% 49C P8 12W / 220W | 156MiB / 8192MiB | 7% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
(base)
docker run -v /dev/null:/var/run/nvidia-container-devices/all ubuntu:22.04 nvidia-smi 24-01-23 - 22:09:08
Wed Jan 24 04:09:15 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 On | N/A |
| 0% 49C P8 12W / 220W | 156MiB / 8192MiB | 5% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
(base)
grep accept-nvidia-visible-devices-as /etc/nvidia-container-runtime/config.toml
accept-nvidia-visible-devices-as-volume-mounts = true
OK. That’s encouraging.
So you’re saying that even with that configured properly if you run the cluster-create.sh script from the k8s-dra-driver repo, docker exec into the worker node created by kind, and run nvidia-smi, it doesn’t work?
well at the moment ./create-cluster.sh ends with this error:
+ kind load docker-image --name k8s-dra-driver-cluster nvcr.io/nvidia/cloud-native/k8s-dra-driver:v0.1.0
Image: "nvcr.io/nvidia/cloud-native/k8s-dra-driver:v0.1.0" with ID "sha256:9c74ea73db6f97a5e7287e11888757504b1e5ecfde4d2e5aa8396a25749ae046" not yet present on node "k8s-dra-driver-cluster-control-plane", loading...
Image: "nvcr.io/nvidia/cloud-native/k8s-dra-driver:v0.1.0" with ID "sha256:9c74ea73db6f97a5e7287e11888757504b1e5ecfde4d2e5aa8396a25749ae046" not yet present on node "k8s-dra-driver-cluster-worker", loading...
ERROR: failed to load image: command "docker exec --privileged -i k8s-dra-driver-cluster-control-plane ctr --namespace=k8s.io images import --all-platforms --digests --snapshotter=overlayfs -" failed with error: exit status 1
Command Output: unpacking nvcr.io/nvidia/cloud-native/k8s-dra-driver:v0.1.0 (sha256:e9df1b5622ca4f042dcff02f580a0a18ecad4b740fe639df2349c55067ef35b7)...time="2024-01-24T04:21:59Z" level=info msg="apply failure, attempting cleanup" error="wrong diff id calculated on extraction \"sha256:f344b08ff6c5121d786112e0f588c627da349e4289e409d1fde1b3ad8845fa66\"" key="extract-191866144-_8aF sha256:6c3e7df31590f02f10cb71fc4eb27653e9b428df2e6e5421a455b062bd2e39f9"
ctr: wrong diff id calculated on extraction "sha256:f344b08ff6c5121d786112e0f588c627da349e4289e409d1fde1b3ad8845fa66"
and ./install-dra-driver.sh now fails with:
+ kubectl label node k8s-dra-driver-cluster-control-plane --overwrite nvidia.com/dra.controller=true
node/k8s-dra-driver-cluster-control-plane labeled
+ helm upgrade -i --create-namespace --namespace nvidia-dra-driver nvidia /home/thoth/k8s-dra-driver/deployments/helm/k8s-dra-driver --wait
Release "nvidia" does not exist. Installing it now.
Error: client rate limiter Wait returned an error: context deadline exceeded
the build is successful from: ./build-dra-driver.sh
so I'm kind of confused at what is wrong.
I tried doing an equivalent ctr run with:
sudo ctr run --env NVIDIA_VISIBLE_DEVICES=all docker.io/library/ubuntu:22.04 nvidia-smi
but it is just hanging here with no output.
I figured out the equivalent ctr command ( I had nvidiacontainer missing above):
sudo ctr run --env NVIDIA_VISIBLE_DEVICES=all docker.io/library/ubuntu:22.04 nvidiacontainer nvidia-smi
ctr: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "/usr/bin/nvidia-smi": stat /usr/bin/nvidia-smi: no such file or directory: unknown
in comparison to the docker:
docker run -v /dev/null:/var/run/nvidia-container-devices/all ubuntu:22.04 nvidia-smi
Wed Jan 24 05:23:32 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 On | N/A |
| 0% 50C P8 12W / 220W | 117MiB / 8192MiB | 6% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
which I'm kind of uncertain why that file exists here, but not in the ctr form?
docker run -v /dev/null:/var/run/nvidia-container-devices/all ubuntu:22.04 which nvidia-smi
/usr/bin/nvidia-smi
sudo ctr run --env NVIDIA_VISIBLE_DEVICES=all docker.io/library/ubuntu:22.04 nvidia-smi500 which nvidia-smi
sudo ctr run --env NVIDIA_VISIBLE_DEVICES=all docker.io/library/ubuntu:22.04 nvidia-smi501 ls /usr/bin/nvidia-smi
ls: cannot access '/usr/bin/nvidia-smi': No such file or directory
probably some magic I'm unaware of.
ctr does not use the nvidia-container-runtime even if you have configured the CRI plugin in the containerd config to use it. The ctr command does not use CRI so it would need to be configured elsewhere to use the nvidia runtime (but that wouldn’t help with your current problem anyway of trying to get k8s to work — which does communicate with containerd over CRI).
Since I don't have k0s experience, let's start out assuming that your goal is to install the GPU Operator in a Kind cluster with GPU support. This involves two stages:
I've tried to provide more details for each of the stages below. In order to get to the bottom of this issue we would need to identify which of these is not working as expected. Once we've run through the steps for kind it may be possible to map the steps to something like k0s.
Note that as prerequisites:
nvidia-smi
there, that seems to already be the case.This needs to be set up as described in https://github.com/kubernetes-sigs/kind/pull/3257#issuecomment-1607287275
This means that we need to do the following:
nvidia
runtime is configured as the default runtime in the docker daemon config. (Note that the Daemon needs to be restarted to apply this config).
sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
sudo systemctl restart docker
accept-nvidia-visible-devices-as-volume-mounts = true
is set in /etc/nvidia-container-runtime/config.toml
)/dev/null:/var/run/nvidia-container-devices/all
) is added to the Kind config for the nodes that require GPU access.In order to verify that the nodes have the GPU devices and Driver installed correctly one can exec into the Kind worker node and run nvidia-smi
:
docker exec -ti <node-cluster> nvidia-smi -L
This should give the same output as on the host. I noted in your example that you are starting a single node Kind cluster. This should not affect the behaviour, but is a difference between our cluster definitions and the ones that you use.
At this point, the Kind cluster represents a k8s cluster with ony the GPU Driver installed. Even though the NVIDIA Container Toolkit is installed on the host, it has not been injected into the nodes.
This means that we should do one of the following:
--set toolkit.enabled=true
(the default) is specified when installing the GPU Operator. (Note that your description mentions that --set toolkit.enabled=false
was specified).For the Kind demo included in this repo, we don't use the GPU operator and as such we install the container toolkit when creating the cluster: https://github.com/NVIDIA/k8s-device-plugin/blob/2bef25804caf5924f35a164158f097f954fe4c74/demo/clusters/kind/scripts/create-kind-cluster.sh#L38-L47
Note that since the Kind nodes themselves are effectively Debian nodes and are not officially supported. Most of this might be due to driver cotainer limitations and may not be applicable in this case, since we are dealing with a preinstalled driver.
on the host:
nvidia-ctk --version
NVIDIA Container Toolkit CLI version 1.14.4
commit: d167812ce3a55ec04ae2582eff1654ec812f42e1
cat /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
}
}
cat /etc/nvidia-container-runtime/config.toml
accept-nvidia-visible-devices-as-volume-mounts = true
accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
#swarm-resource = "DOCKER_RESOURCE_GPU"
[nvidia-container-cli]
#debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig"
load-kmods = true
#no-cgroups = false
#path = "/usr/bin/nvidia-container-cli"
#root = "/run/nvidia/driver"
#user = "root:video"
[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"
mode = "auto"
runtimes = ["docker-runc", "runc"]
[nvidia-container-runtime.modes]
[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["cdi.k8s.io/"]
default-kind = "nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]
[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"
[nvidia-container-runtime-hook]
path = "nvidia-container-runtime-hook"
skip-mode-detection = false
[nvidia-ctk]
path = "nvidia-ctk"
docker exec -it 3251f /bin/bash ✭
root@k8s-dra-driver-cluster-worker:/# nvidia-smi
Wed Jan 24 15:10:56 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 On | N/A |
| 0% 47C P8 12W / 220W | 169MiB / 8192MiB | 5% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
root@k8s-dra-driver-cluster-worker:/# nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 3070 (UUID: GPU-b83f1b66-74d7-a38e-932e-ef815cb45105)
However I seem to be stuck on the install inside the worker:
root@k8s-dra-driver-cluster-worker:/# apt-get install -y nvidia-container-toolkit
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
nvidia-container-toolkit is already the newest version (1.15.0~rc.1-1).
You might want to run 'apt --fix-broken install' to correct these.
The following packages have unmet dependencies:
nvidia-container-toolkit : Depends: nvidia-container-toolkit-base (= 1.15.0~rc.1-1) but it is not going to be installed
E: Unmet dependencies. Try 'apt --fix-broken install' with no packages (or specify a solution).
root@k8s-dra-driver-cluster-worker:/# apt-get install -y nvidia-container-toolkit-base
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
nvidia-container-toolkit-base
0 upgraded, 1 newly installed, 0 to remove and 26 not upgraded.
3 not fully installed or removed.
Need to get 2361 kB of archives.
After this operation, 10.8 MB of additional disk space will be used.
Get:1 https://nvidia.github.io/libnvidia-container/experimental/deb/amd64 nvidia-container-toolkit-base 1.15.0~rc.1-1 [2361 kB]
Fetched 2361 kB in 0s (10.6 MB/s)
debconf: delaying package configuration, since apt-utils is not installed
(Reading database ... 11315 files and directories currently installed.)
Preparing to unpack .../nvidia-container-toolkit-base_1.15.0~rc.1-1_amd64.deb ...
Unpacking nvidia-container-toolkit-base (1.15.0~rc.1-1) ...
dpkg: error processing archive /var/cache/apt/archives/nvidia-container-toolkit-base_1.15.0~rc.1-1_amd64.deb (--unpack):
unable to make backup link of './usr/bin/nvidia-ctk' before installing new version: Invalid cross-device link
Errors were encountered while processing:
/var/cache/apt/archives/nvidia-container-toolkit-base_1.15.0~rc.1-1_amd64.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)
root@k8s-dra-driver-cluster-worker:/# apt --fix-broken install
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Correcting dependencies... Done
The following additional packages will be installed:
nvidia-container-toolkit-base
The following NEW packages will be installed:
nvidia-container-toolkit-base
0 upgraded, 1 newly installed, 0 to remove and 26 not upgraded.
3 not fully installed or removed.
Need to get 2361 kB of archives.
After this operation, 10.8 MB of additional disk space will be used.
Do you want to continue? [Y/n]
Get:1 https://nvidia.github.io/libnvidia-container/experimental/deb/amd64 nvidia-container-toolkit-base 1.15.0~rc.1-1 [2361 kB]
Fetched 2361 kB in 0s (11.9 MB/s)
debconf: delaying package configuration, since apt-utils is not installed
(Reading database ... 11315 files and directories currently installed.)
Preparing to unpack .../nvidia-container-toolkit-base_1.15.0~rc.1-1_amd64.deb ...
Unpacking nvidia-container-toolkit-base (1.15.0~rc.1-1) ...
dpkg: error processing archive /var/cache/apt/archives/nvidia-container-toolkit-base_1.15.0~rc.1-1_amd64.deb (--unpack):
unable to make backup link of './usr/bin/nvidia-ctk' before installing new version: Invalid cross-device link
Errors were encountered while processing:
/var/cache/apt/archives/nvidia-container-toolkit-base_1.15.0~rc.1-1_amd64.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)
of note, I am using the kind cluster config from this repo:
so no longer single-node.
For "reasons" we were injecting the /usr/bin/nvidia-ctk
binary from the host into the container for the k8s-dra-driver
. This is what is causing:
dpkg: error processing archive /var/cache/apt/archives/nvidia-container-toolkit-base_1.15.0~rc.1-1_amd64.deb (--unpack):
unable to make backup link of './usr/bin/nvidia-ctk' before installing new version: Invalid cross-device link
Remove the lines here in the kind cluster config. (Or unmount /usr/bin/nvidia-ctk
before trying to install the toolkit).
I have an open action item to improve the installation of the toolkit in the DRA driver repo, but have not gotten around to it.
so unmounting /usr/bin/nvidia-ctk fixed the apt issues, and I can install nvidia-container-toolkit just fine, but that doesn't solve the problem, the nvidia-device-plugin-daemonset still seems unable to see the GPU
k logs -n kube-system nvidia-device-plugin-daemonset-d82pg
I0125 03:54:44.043725 1 main.go:154] Starting FS watcher.
I0125 03:54:44.043771 1 main.go:161] Starting OS watcher.
I0125 03:54:44.043840 1 main.go:176] Starting Plugins.
I0125 03:54:44.043849 1 main.go:234] Loading configuration.
I0125 03:54:44.043895 1 main.go:242] Updating config with default resource matching patterns.
I0125 03:54:44.043975 1 main.go:253]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": false,
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
I0125 03:54:44.043979 1 main.go:256] Retreiving plugins.
W0125 03:54:44.044136 1 factory.go:31] No valid resources detected, creating a null CDI handler
I0125 03:54:44.044156 1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0125 03:54:44.044172 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0125 03:54:44.044174 1 factory.go:115] Incompatible platform detected
E0125 03:54:44.044176 1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0125 03:54:44.044178 1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0125 03:54:44.044179 1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0125 03:54:44.044181 1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
I0125 03:54:44.044185 1 main.go:287] No devices found. Waiting indefinitely.
@joshuacox is containerd in the Kind node configured to use the nvidia
runtime. In addition, if you don't set it to be the default you will have to add a runtimeClass
and specify this when installing the plugin.
See https://github.com/NVIDIA/k8s-device-plugin/blob/2bef25804caf5924f35a164158f097f954fe4c74/demo/clusters/kind/scripts/create-kind-cluster.sh#L50-L55 where we do this for the device plugin.
If you're installing the GPU Operator with --set toolkit.enabled=true
this should be taken care of for you.
I am just fine with setting toolkit.enabled=true or any other flags, I just want it to work.
Seems to be getting closer, do I need to umount another symlink here?
k logs -ngpu-operator nvidia-operator-validator-j6hfp -c driver-validation
time="2024-01-25T10:46:28Z" level=info msg="version: 8072420d"
time="2024-01-25T10:46:28Z" level=info msg="Detected pre-installed driver on the host"
running command chroot with args [/host nvidia-smi]
Thu Jan 25 10:46:28 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 On | N/A |
| 0% 49C P8 11W / 220W | 152MiB / 8192MiB | 3% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
time="2024-01-25T10:46:28Z" level=info msg="creating symlinks under /dev/char that correspond to NVIDIA character devices"
time="2024-01-25T10:46:28Z" level=info msg="Error: error validating driver installation: error creating symlink creator: failed to load NVIDIA kernel modules: failed to load module nvidia: exit status 1; output=modprobe: FATAL: Module nvidia not found in directory /lib/modules/6.1.0-17-amd64\n\n\nFailed to create symlinks under /dev/char that point to all possible NVIDIA character devices.\nThe existence of these symlinks is required to address the following bug:\n\n https://github.com/NVIDIA/gpu-operator/issues/430\n\nThis bug impacts container runtimes configured with systemd cgroup management enabled.\nTo disable the symlink creation, set the following envvar in ClusterPolicy:\n\n validator:\n driver:\n env:\n - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n value: \"true\""
that was from a ./create-cluster.sh (in /k8s-dra-driver/demo/clusters/kind)
with this afterwards:
#!/bin/bash
#
export KIND_CLUSTER_NAME=k8s-dra-driver-cluster
docker exec -it "${KIND_CLUSTER_NAME}-worker" bash -c "umount /usr/bin/nvidia-ctk && apt-get update && apt-get install -y gpg && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && curl -s -L https://nvidia.github.io/libnvidia-container/experimental/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list && apt-get update && apt-get install -y nvidia-container-toolkit && nvidia-ctk config --set nvidia-container-runtime.modes.cdi.annotation-prefixes=nvidia.cdi.k8s.io/ && nvidia-ctk runtime configure --runtime=containerd --cdi.enabled && systemctl restart containerd"
helm install \
--wait \
--generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set driver.enabled=true \
--set toolkit.enabled=true
This issue is probably due to the symlink creation not working under kind. Please update the environement for the validator in the ClusterPolicy to disable the creation of symlinks as described in the error message.
Environment for the validator in ClusterPolicy?
I have a tiny section of the daemonset that has a clusterpolicy
k get daemonset -n gpu-operator nvidia-operator-validator -o yaml|grep -C10 -i clusterpolicy
manager: kube-controller-manager
operation: Update
subresource: status
time: "2024-01-25T15:25:42Z"
name: nvidia-operator-validator
namespace: gpu-operator
ownerReferences:
- apiVersion: nvidia.com/v1
blockOwnerDeletion: true
controller: true
kind: ClusterPolicy
name: cluster-policy
uid: 1c2e2c3d-b21e-4767-8dd7-18c1535552de
resourceVersion: "23601"
uid: 30f847a6-654e-4136-b362-f912eb344d4c
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
app: nvidia-operator-validator
app.kubernetes.io/part-of: gpu-operator
Which all of this seems way beyond the documentation. @elezar is this because as you said "Debian nodes and are not officially supported". If so what nodes are supported? On this page:
https://nvidia.github.io/libnvidia-container/stable/deb/
it says:
ubuntu18.04, ubuntu20.04, ubuntu22.04, debian10, debian11
so is this all because my host OS is debian 12?
It just means when you start the operator, additionally pass:
--set validator.driver.env[0].name="DISABLE_DEV_CHAR_SYMLINK_CREATION"
--set validator.driver.env[0].value="true"
Error: INSTALLATION FAILED: 1 error occurred:
* ClusterPolicy.nvidia.com "cluster-policy" is invalid: spec.validator.driver.env[0].value: Invalid value: "boolean": spec.validator.driver.env[0].value in body must be of type string: "boolean"
I also tried removing the quotes around true to match my other set lines, and got the exact same results.
#!/bin/bash
#
export KIND_CLUSTER_NAME=k8s-dra-driver-cluster
docker exec -it "${KIND_CLUSTER_NAME}-worker" bash -c "umount /usr/bin/nvidia-ctk && apt-get update && apt-get install -y gpg && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && curl -s -L https://nvidia.github.io/libnvidia-container/experimental/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list && apt-get update && apt-get install -y nvidia-container-toolkit && nvidia-ctk config --set nvidia-container-runtime.modes.cdi.annotation-prefixes=nvidia.cdi.k8s.io/ && nvidia-ctk runtime configure --runtime=containerd --cdi.enabled && systemctl restart containerd"
helm install \
--wait \
--generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set driver.enabled=true \
--set toolkit.enabled=true \
--set validator.driver.env[0].name="DISABLE_DEV_CHAR_SYMLINK_CREATION" \
--set validator.driver.env[0].value=true
I am also not seeing a validator section in the values.yaml:
Am I looking in the wrong place?
use --set-string
not all possible values are shown in the top-level values.yaml
omg @klueska that one works!
kgp -n gpu-operator ✭
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-jkfwl 1/1 Running 0 3m59s
gpu-operator-1706209589-node-feature-discovery-gc-7ccd95f7qcvpg 1/1 Running 0 4m12s
gpu-operator-1706209589-node-feature-discovery-master-7cdfmh5zt 1/1 Running 0 4m12s
gpu-operator-1706209589-node-feature-discovery-worker-wcwsp 1/1 Running 0 4m12s
gpu-operator-1706209589-node-feature-discovery-worker-xdcxd 1/1 Running 0 4m12s
gpu-operator-c4fd7b4b7-rv28r 1/1 Running 0 4m12s
nvidia-container-toolkit-daemonset-n994z 1/1 Running 0 3m59s
nvidia-cuda-validator-76zm5 0/1 Completed 0 3m42s
nvidia-dcgm-exporter-b6cs5 1/1 Running 0 3m59s
nvidia-device-plugin-daemonset-4mbb2 1/1 Running 0 3m59s
nvidia-operator-validator-z26kp 1/1 Running 0 3m59s
and to be clear, for any of you stumbling in from the internet here is my complete additional steps, beyond ./create-cluster.sh
:
#!/bin/bash
#
export KIND_CLUSTER_NAME=k8s-dra-driver-cluster
docker exec -it "${KIND_CLUSTER_NAME}-worker" bash -c "umount /usr/bin/nvidia-ctk && apt-get update && apt-get install -y gpg && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && curl -s -L https://nvidia.github.io/libnvidia-container/experimental/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list && apt-get update && apt-get install -y nvidia-container-toolkit && nvidia-ctk config --set nvidia-container-runtime.modes.cdi.annotation-prefixes=nvidia.cdi.k8s.io/ && nvidia-ctk runtime configure --runtime=containerd --cdi.enabled && systemctl restart containerd"
helm install \
--wait \
--generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set driver.enabled=true \
--set toolkit.enabled=true \
--set validator.driver.env[0].name="DISABLE_DEV_CHAR_SYMLINK_CREATION" \
--set-string validator.driver.env[0].value="true"
Now then why did I have to do all this extra work over and above the documentation? Is it just because I'm on debian 12 (I started on Arch linux before opening this issue I decided debian might be more stable). If this is the expected behavior I'll gladly make a PR documenting all this, but somehow I feel this is not the case? I am installing jammy22.04 to a partition to test some more.
You're probably the first to run the operator under kind
.
Hmmm, now I am going to have to give this another shot using another method, as I said I've tried k0s above and will give that a second try now that I have a working sanity check. I am familiar bootstraping a cluster using kubeadm and kubespray both, I even scripted it all out with another project kubash.
Are there any other setups that anyone has tried? What is 'supported'?
I've transferred this issue to the gpu-operator
repo (since that's what the issue was really related to). I'll let the operator devs answer your last question.
@joshuacox just for reference. The compatibility with Debian that is an issue here is not that of the NVIDIA Container Toolkit (or even the device plugin), but that of the GPU Operator. For the official support matrix see: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/platform-support.html#supported-operating-systems-and-kubernetes-platforms
Note that it is my understanding that this is largely due to the driver container, but there may be some subtle issues that arrise from not having qualified the stack on the target operating system.
For what it's worth, we are starting to look at using kind for basic internal tests, and as we address some rough edges for this these should make it into the released versions -- although the question of official platform support is not something that I can speak to at present.
@elezar and @klueska thank you guys for helping so much! And thanks for the transfer to this repo, this is probably where I should've submitted the issue in the first place.
@elezar how can I help in facilitating in building these internal tests? I am looking around this repo, I don't see a demo directory like we were dealing with above, is that the sort of thing we might want to build here? I'd certainly be interested in facilitating any of this process that I can.
@elezar how can I help in facilitating in building these internal tests? I am looking around this repo, I don't see a demo directory like we were dealing with above, is that the sort of thing we might want to build here? I'd certainly be interested in facilitating any of this process that I can.
Althought @shivamerla and @cdesiniotis should also chime in here, I think creating a PR adding a demo
folder including a basic README.md that runs through getting the GPU Operator installed on kind
-- mirroring what we have for the k8s-dra-driver and the k8s-device-plugin -- would be a good start.
I think creating a PR adding a demo folder including a basic README.md that runs through getting the GPU Operator installed on kind
This is fine by me. @joshuacox contributions are welcome!
@joshuacox there is one minor detail I would like to point out. In your helm install command, you explicitly set driver.enabled=true
which is actually not necessary in this case. The kind
node already has access to the driver installation from the host, so the GPU Operator does not need to install the driver. In fact, you won't see a pod named nvidia-driver
in your pod list because the operator detected that the NVIDIA driver was already installed and disabled the containerized driver deployment for you.
@joshuacox there is one minor detail I would like to point out. In your helm install command, you explicitly set driver.enabled=true which is actually not necessary in this case. The kind node already has access to the driver installation from the host, so the GPU Operator does not need to install the driver. In fact, you won't see a pod named nvidia-driver in your pod list because the operator detected that the NVIDIA driver was already installed and disabled the containerized driver deployment for you.
To clarify: since driver.enabled=true
is the default, the GPU Operator correctly identifies a preinstalled driver and skips the deployment of the driver container. It may be better to leave out the flag, or explicitly set it ot false
to avoid confusion.
@elezar @cdesiniotis I have set it to false for now, I have a WIP branch here.
I'm not seeing any nvidia-driver pods, but I definitely have a lot more pods and more importantly an allocatable GPU with the release chart. At the moment if install the release chart nvidia/gpu-operator
, I get something like this:
kubectl get po -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-mbmdj 1/1 Running 0 2m57s
gpu-operator-657b8ffcc-h4wsh 1/1 Running 0 3m17s
nvidia-container-toolkit-daemonset-gknj7 1/1 Running 0 2m58s
nvidia-cuda-validator-f949w 0/1 Completed 0 2m39s
nvidia-dcgm-exporter-gpc7b 1/1 Running 0 2m58s
nvidia-device-plugin-daemonset-mpm6r 1/1 Running 0 2m58s
nvidia-gpu-operator-node-feature-discovery-gc-64bc8485cd-4w7bw 1/1 Running 0 3m17s
nvidia-gpu-operator-node-feature-discovery-master-7fb4d54954j9c 1/1 Running 0 3m17s
nvidia-gpu-operator-node-feature-discovery-worker-gf9dr 1/1 Running 0 3m17s
nvidia-gpu-operator-node-feature-discovery-worker-wzhq4 1/1 Running 0 3m17s
nvidia-operator-validator-7rqbz 1/1 Running 0 2m58s
yet I only get these pods when I use the local chart:
kubectl get po -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-operator-55df7d9cdd-m5xbm 1/1 Running 0 3m57s
nvidia-gpu-operator-node-feature-discovery-gc-64bc8485cd-knqvz 1/1 Running 0 3m57s
nvidia-gpu-operator-node-feature-discovery-master-7fb4d549zsx6l 1/1 Running 0 3m57s
nvidia-gpu-operator-node-feature-discovery-worker-6kh69 1/1 Running 0 3m57s
nvidia-gpu-operator-node-feature-discovery-worker-ktnnh 1/1 Running 0 3m57s
with the only difference between the two scripts being:
diff install-operator.sh install-release-operator.sh
35c35
< ${PROJECT_DIR}/deployments/gpu-operator
---
> nvidia/gpu-operator
I am running full delete cluster, create cluster and install operator with the demo.sh e.g.
for the local chart:
./demo.sh local
for the release chart:
./demo.sh release
Something eludes me as to what the difference is at the moment, I'll do some diff'ing around to investigate. I'll go ahead and prep a PR soon, but it's still a WIP for now.
@elezar and @klueska , the only real difference I can see is the gdrcopy section on the local driver, am I missing something else?
diff -r /tmp/gpu-operator-release/gpu-operator /tmp/gpu-operator/gpu-operator
diff --color -r /tmp/gpu-operator-release/gpu-operator/templates/clusterpolicy.yaml /tmp/gpu-operator/gpu-operator/templates/clusterpolicy.yaml
9c9
< helm.sh/chart: gpu-operator-v23.9.1
---
> helm.sh/chart: gpu-operator-v1.0.0-devel
11c11
< app.kubernetes.io/version: "v23.9.1"
---
> app.kubernetes.io/version: "devel-ubi8"
25c25
< helm.sh/chart: gpu-operator-v23.9.1
---
> helm.sh/chart: gpu-operator-v1.0.0-devel
38c38
< version: "v23.9.1"
---
> version: "devel-ubi8"
268c268,274
< version: "v23.9.1"
---
> version: "devel-ubi8"
> imagePullPolicy: IfNotPresent
> gdrcopy:
> enabled: false
> repository: nvcr.io/nvidia/cloud-native
> image: gdrdrv
> version: "v2.4.1"
diff --color -r /tmp/gpu-operator-release/gpu-operator/templates/operator.yaml /tmp/gpu-operator/gpu-operator/templates/operator.yaml
9c9
< helm.sh/chart: gpu-operator-v23.9.1
---
> helm.sh/chart: gpu-operator-v1.0.0-devel
11c11
< app.kubernetes.io/version: "v23.9.1"
---
> app.kubernetes.io/version: "devel-ubi8"
25c25
< helm.sh/chart: gpu-operator-v23.9.1
---
> helm.sh/chart: gpu-operator-v1.0.0-devel
27c27
< app.kubernetes.io/version: "v23.9.1"
---
> app.kubernetes.io/version: "devel-ubi8"
39c39
< image: nvcr.io/nvidia/gpu-operator:v23.9.1
---
> image: nvcr.io/nvidia/gpu-operator:devel-ubi8
diff --color -r /tmp/gpu-operator-release/gpu-operator/templates/rolebinding.yaml /tmp/gpu-operator/gpu-operator/templates/rolebinding.yaml
9c9
< helm.sh/chart: gpu-operator-v23.9.1
---
> helm.sh/chart: gpu-operator-v1.0.0-devel
11c11
< app.kubernetes.io/version: "v23.9.1"
---
> app.kubernetes.io/version: "devel-ubi8"
diff --color -r /tmp/gpu-operator-release/gpu-operator/templates/role.yaml /tmp/gpu-operator/gpu-operator/templates/role.yaml
9c9
< helm.sh/chart: gpu-operator-v23.9.1
---
> helm.sh/chart: gpu-operator-v1.0.0-devel
11c11
< app.kubernetes.io/version: "v23.9.1"
---
> app.kubernetes.io/version: "devel-ubi8"
diff --color -r /tmp/gpu-operator-release/gpu-operator/templates/serviceaccount.yaml /tmp/gpu-operator/gpu-operator/templates/serviceaccount.yaml
9c9
< helm.sh/chart: gpu-operator-v23.9.1
---
> helm.sh/chart: gpu-operator-v1.0.0-devel
11c11
< app.kubernetes.io/version: "v23.9.1"
---
> app.kubernetes.io/version: "devel-ubi8"
I have a draft PR open here
@klueska @elezar @cdesiniotis PR is open and ready if only the release chart is considered. I am still having issues with the local chart in the deployments
directory, I have added details of the issue in the PR, and I've streamlined the scripts to illustrate the problem.
In short, release works great e.g.
./demo.sh release
However, the local install with gdrcopy both enabled and disabled are falling a bit short. e.g.
./demo.sh gdrcopy
./demo.sh local
I'm failing to see the real difference though in the actual chart.
1. Issue or feature description
When following the quickstart I end up with this error in
k describe po -n gpu-operator gpu-feature-discovery-6tk4h
Warning FailedCreatePodSandBox 0s (x5 over 49s) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
2. Steps to reproduce the issue
3. Information to attach (optional if deemed irrelevant)
with my kind-config.yaml
Common error checking:
nvidia-smi -a
on your hostand
docker run --rm nvidia/cuda:12.3.1-devel-centos7 nvidia-smi
/etc/docker/daemon.json
)and /etc/docker/daemon.json
and /etc/containerd/config.toml
sudo journalctl -r -u kubelet
)Additional information that might help better understand your environment and reproduce the bug:
docker version
and the helm below fails as well:
uname -a
uname -a Linux saruman 6.1.0-17-amd64 NVIDIA/k8s-device-plugin#1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux
dmesg
none that I see?
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
nvidia-container-cli -V
nvidia-container-cli -V cli-version: 1.14.3 lib-version: 1.14.3 build date: 2023-10-19T11:32+00:00 build revision: 1eb5a30a6ad0415550a9df632ac8832bf7e2bbba build compiler: x86_64-linux-gnu-gcc-7 7.5.0 build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
the above page no longer exists.
sudo journalctl -u nvidia-container-toolkit -- No entries --