Open sfxworks opened 1 year ago
According to this https://github.com/NVIDIA/gpu-operator/issues/401#issuecomment-1245932303 this change was applied, but the helm chart may not be referencing the latest image by default
@sfxworks what version of GPU Operator are you using? We migrated to node.k8s.io/v1
in v22.9.0
devel-ubi8 according to https://github.com/NVIDIA/gpu-operator/blob/master/deployments/gpu-operator/values.yaml#L50
nvidia-driver-daemonset-ttzrt 0/1 Init:0/1 0 22s 10.0.7.146 home-2cf05d8a44a0 <none> <none>
The tag you linked worked.
Though now other images are having issues with their defaults
Normal Pulling 70s (x4 over 2m39s) kubelet Pulling image "nvcr.io/nvidia/driver:525.60.13-"
Warning Failed 68s (x4 over 2m37s) kubelet Failed to pull image "nvcr.io/nvidia/driver:525.60.13-": rpc error: code = Unknown desc = reading manifest 525.60.13- in nvcr.io/nvidia/driver: manifest unknown: manifest unknown
Is there a publicly viewable way to see your registry's tags to resolve this quicker? They just time out.
..
Changing the version of the driver to latest in the helm chat adds a -, leading to an invalid image
image: nvcr.io/nvidia/driver:latest-
containers:
- args:
- init
command:
- nvidia-driver
image: nvcr.io/nvidia/driver:latest-
imagePullPolicy: IfNotPresent
name: nvidia-driver-ctr
resources: {}
securityContext:
privileged: true
It doesn't like my kernel anyway I guess :/
Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init)
========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver version 450.80.02 for Linux kernel version 6.0.11-hardened1-1-hardened
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Could not resolve Linux kernel version
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Switching over the machine to linux vs liunx hardened with the above adjustments seems successful. Between then and now I did not have to adjust the daemonset either.
nvidia.com/gpu.compute.major: "7"
nvidia.com/gpu.compute.minor: "5"
nvidia.com/gpu.count: "1"
nvidia.com/gpu.deploy.container-toolkit: "true"
nvidia.com/gpu.deploy.dcgm: "true"
nvidia.com/gpu.deploy.dcgm-exporter: "true"
Resource Requests Limits
-------- -------- ------
cpu 3300m (30%) 3500m (31%)
memory 12488Mi (19%) 12638Mi (19%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
nvidia.com/gpu 0 0
@sfxworks for installing the latest helm charts, please refer to: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#install-nvidia-gpu-operator.
We append a -<os>
suffix (e.g. -ubuntu20.04
) to match the OS of your worker nodes. We depend on labels from NFD (feature.node.kubernetes.io/system-os_release.ID
and feature.node.kubernetes.io/system-os_release.VERSION_ID
) for getting this information. If only -
was appended then its possible these labels were missing. Concerning the kernel version, the driver container requires several kernel packages (i.e. kernel-devel). From your logs, it appears it could not find these packages for 6.0.11-hardened1-1-hardened
. A workaround is to pass a custom repository file to the driver pod so it can properly find packages for the particular kernel. Following page has some details on how to do this: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/appendix.html#local-package-repository
I have feature.node.kubernetes.io/system-os_release.ID: arch
though I do not have feature.node.kubernetes.io/system-os_release.VERSION_ID
on any nodes (some manjaro based, some arch based). I cannot remember how I had this working before...
I just installed GPU Operator with helm version v23.3.1
. This version use nvcr.io/nvidia/gpu-operator:devel-ubi8
image which has exactly this error:
1.6704367755844975e+09 ERROR controller.clusterpolicy-controller Reconciler error {"name": "cluster-policy", "namespace": "", "error": "no matches for kind \"RuntimeClass\" in version \"node.k8s.io/v1beta1\""}
When I change GPU Operator to v22.9.2
version, it use nvcr.io/nvidia/gpu-operator:v22.9.0
image and the error disappeared. Can you please check it again @cdesiniotis
I'm also running into this issue ... on both release-23.03
and master
branches. microk8s v1.27.2 on Ubuntu 22.04.2 LTS.
I'm also running into this issue ... on both
release-23.03
andmaster
branches. microk8s v1.27.2 on Ubuntu 22.04.2 LTS.
Same error for us with EKS 1.27 and Ubuntu 22
Same error with release 23.3.1 any solution ...?
also running into this on Amazon Linux 2 - any known solution or workaround, something missing in the docs? trying to override the api version or look at the daemonset values next
release v24.6.1 - nvcr.io/nvidia/gpu-operator:devel-ubi8
ERROR controller.clusterpolicy-controller Reconciler error {"name": "cluster-policy", "namespace": "", "error": "no matches for kind \"RuntimeClass\" in version \"node.k8s.io/v1beta1\""}
kubectl describe node GPU-NODE | grep system feature.node.kubernetes.io/system-os_release.ID=amzn feature.node.kubernetes.io/system-os_release.VERSION_ID=2 feature.node.kubernetes.io/system-os_release.VERSION_ID.major=2
node: yum list installed | grep kernel kernel.x86_64 5.10.220-209.869.amzn2 @amzn2extra-kernel-5.10 kernel-devel.x86_64 5.10.220-209.869.amzn2 @amzn2extra-kernel-5.10 kernel-headers.x86_64 5.10.220-209.869.amzn2 @amzn2extra-kernel-5.10
@robjcook it looks like you have deployed the local helm chart that is checked into the gpu-operator main branch. We don't recommend using that helm chart.
Please use the helm chart from the official helm repo as instructed here
few node toleration things worked passed and changed to helm chart in the official helm repo
running into issue now where operator seems to be looking for image that does not exist and fails to pull
ImagePullBackOff (Back-off pulling image "nvcr.io/nvidia/driver:550.90.07-amzn2")
which image do recommend for Amazon Linux 2 node and where to specify instead of dynamically let operator interpret from the node?
edit: after digging through documentation looking into this now
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/precompiled-drivers.html
1. Quick Debug Checklist
i2c_core
andipmi_msghandler
loaded on the nodes?kubectl describe clusterpolicies --all-namespaces
)1. Issue or feature description
Deployed with helm, the operator attempts to reference a deprecated API object which prevents deployment.
As noted in https://kubernetes.io/docs/reference/using-api/deprecation-guide/#runtimeclass-v125 Nodes are now v1
The operator cannot reconcile and deployment of a pod requesting a GPU fails as result
2. Steps to reproduce the issue