NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.87k stars 304 forks source link

BREAKS ON 1.25: Does not work on k8s 1.25 due to node API deprecation #458

Open sfxworks opened 1 year ago

sfxworks commented 1 year ago

1. Quick Debug Checklist

1. Issue or feature description

Deployed with helm, the operator attempts to reference a deprecated API object which prevents deployment.

As noted in https://kubernetes.io/docs/reference/using-api/deprecation-guide/#runtimeclass-v125 Nodes are now v1

kubectl get node home-2cf05d8a44a0 -o yaml | head -2                                                                                                                                                                                                                                             
apiVersion: v1
kind: Node

The operator cannot reconcile and deployment of a pod requesting a GPU fails as result

1.6704367753266153e+09  INFO    controllers.ClusterPolicy       Checking GPU state labels on the node   {"NodeName": "home-2cf05d8a44a0"}
1.6704367753266478e+09  INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.node-status-exporter", " value=": "true"}
1.670436775326656e+09   INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.operator-validator", " value=": "true"}
1.6704367753266625e+09  INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.driver", " value=": "true"}
1.6704367753266687e+09  INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.gpu-feature-discovery", " value=": "true"}
1.6704367753266747e+09  INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.container-toolkit", " value=": "true"}
1.670436775326681e+09   INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.device-plugin", " value=": "true"}
1.6704367753266864e+09  INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.dcgm", " value=": "true"}
1.6704367753266923e+09  INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.dcgm-exporter", " value=": "true"}
1.67043677532671e+09    INFO    controllers.ClusterPolicy       Number of nodes with GPU label  {"NodeCount": 1}
1.6704367753267498e+09  INFO    controllers.ClusterPolicy       Using container runtime: crio
1.6704367755844975e+09  ERROR   controller.clusterpolicy-controller     Reconciler error        {"name": "cluster-policy", "namespace": "", "error": "no matches for kind \"RuntimeClass\" in version \"node.k8s.io/v1beta1\""}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227

2. Steps to reproduce the issue

  1. Run Kubernetes 1.25
  2. Deploy the helm operator
sfxworks commented 1 year ago

According to this https://github.com/NVIDIA/gpu-operator/issues/401#issuecomment-1245932303 this change was applied, but the helm chart may not be referencing the latest image by default

cdesiniotis commented 1 year ago

@sfxworks what version of GPU Operator are you using? We migrated to node.k8s.io/v1 in v22.9.0

sfxworks commented 1 year ago

devel-ubi8 according to https://github.com/NVIDIA/gpu-operator/blob/master/deployments/gpu-operator/values.yaml#L50

sfxworks commented 1 year ago

nvidia-driver-daemonset-ttzrt 0/1 Init:0/1 0 22s 10.0.7.146 home-2cf05d8a44a0 <none> <none>

The tag you linked worked.

Though now other images are having issues with their defaults

  Normal   Pulling    70s (x4 over 2m39s)  kubelet            Pulling image "nvcr.io/nvidia/driver:525.60.13-"
  Warning  Failed     68s (x4 over 2m37s)  kubelet            Failed to pull image "nvcr.io/nvidia/driver:525.60.13-": rpc error: code = Unknown desc = reading manifest 525.60.13- in nvcr.io/nvidia/driver: manifest unknown: manifest unknown

Is there a publicly viewable way to see your registry's tags to resolve this quicker? They just time out.

sfxworks commented 1 year ago

.. Changing the version of the driver to latest in the helm chat adds a -, leading to an invalid image image: nvcr.io/nvidia/driver:latest-

      containers:
      - args:
        - init
        command:
        - nvidia-driver
        image: nvcr.io/nvidia/driver:latest-
        imagePullPolicy: IfNotPresent
        name: nvidia-driver-ctr
        resources: {}
        securityContext:
          privileged: true
sfxworks commented 1 year ago

It doesn't like my kernel anyway I guess :/

Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init)

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 450.80.02 for Linux kernel version 6.0.11-hardened1-1-hardened

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Could not resolve Linux kernel version
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
sfxworks commented 1 year ago

Switching over the machine to linux vs liunx hardened with the above adjustments seems successful. Between then and now I did not have to adjust the daemonset either.

    nvidia.com/gpu.compute.major: "7"
    nvidia.com/gpu.compute.minor: "5"
    nvidia.com/gpu.count: "1"
    nvidia.com/gpu.deploy.container-toolkit: "true"
    nvidia.com/gpu.deploy.dcgm: "true"
    nvidia.com/gpu.deploy.dcgm-exporter: "true"
  Resource           Requests       Limits
  --------           --------       ------
  cpu                3300m (30%)    3500m (31%)
  memory             12488Mi (19%)  12638Mi (19%)
  ephemeral-storage  0 (0%)         0 (0%)
  hugepages-1Gi      0 (0%)         0 (0%)
  hugepages-2Mi      0 (0%)         0 (0%)
  nvidia.com/gpu     0              0
cdesiniotis commented 1 year ago

@sfxworks for installing the latest helm charts, please refer to: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#install-nvidia-gpu-operator.

We append a -<os> suffix (e.g. -ubuntu20.04) to match the OS of your worker nodes. We depend on labels from NFD (feature.node.kubernetes.io/system-os_release.ID and feature.node.kubernetes.io/system-os_release.VERSION_ID) for getting this information. If only - was appended then its possible these labels were missing. Concerning the kernel version, the driver container requires several kernel packages (i.e. kernel-devel). From your logs, it appears it could not find these packages for 6.0.11-hardened1-1-hardened. A workaround is to pass a custom repository file to the driver pod so it can properly find packages for the particular kernel. Following page has some details on how to do this: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/appendix.html#local-package-repository

sfxworks commented 1 year ago

I have feature.node.kubernetes.io/system-os_release.ID: arch though I do not have feature.node.kubernetes.io/system-os_release.VERSION_ID on any nodes (some manjaro based, some arch based). I cannot remember how I had this working before...

DatCanCode commented 1 year ago

I just installed GPU Operator with helm version v23.3.1. This version use nvcr.io/nvidia/gpu-operator:devel-ubi8 image which has exactly this error:

1.6704367755844975e+09  ERROR   controller.clusterpolicy-controller     Reconciler error        {"name": "cluster-policy", "namespace": "", "error": "no matches for kind \"RuntimeClass\" in version \"node.k8s.io/v1beta1\""}

When I change GPU Operator to v22.9.2 version, it use nvcr.io/nvidia/gpu-operator:v22.9.0 image and the error disappeared. Can you please check it again @cdesiniotis

berlincount commented 1 year ago

I'm also running into this issue ... on both release-23.03 and master branches. microk8s v1.27.2 on Ubuntu 22.04.2 LTS.

acesir commented 1 year ago

I'm also running into this issue ... on both release-23.03 and master branches. microk8s v1.27.2 on Ubuntu 22.04.2 LTS.

Same error for us with EKS 1.27 and Ubuntu 22

shnigam2 commented 1 year ago

Same error with release 23.3.1 any solution ...?

robjcook commented 3 months ago

also running into this on Amazon Linux 2 - any known solution or workaround, something missing in the docs? trying to override the api version or look at the daemonset values next

release v24.6.1 - nvcr.io/nvidia/gpu-operator:devel-ubi8

ERROR controller.clusterpolicy-controller Reconciler error {"name": "cluster-policy", "namespace": "", "error": "no matches for kind \"RuntimeClass\" in version \"node.k8s.io/v1beta1\""}

kubectl describe node GPU-NODE | grep system feature.node.kubernetes.io/system-os_release.ID=amzn feature.node.kubernetes.io/system-os_release.VERSION_ID=2 feature.node.kubernetes.io/system-os_release.VERSION_ID.major=2

node: yum list installed | grep kernel kernel.x86_64 5.10.220-209.869.amzn2 @amzn2extra-kernel-5.10 kernel-devel.x86_64 5.10.220-209.869.amzn2 @amzn2extra-kernel-5.10 kernel-headers.x86_64 5.10.220-209.869.amzn2 @amzn2extra-kernel-5.10

tariq1890 commented 3 months ago

@robjcook it looks like you have deployed the local helm chart that is checked into the gpu-operator main branch. We don't recommend using that helm chart.

Please use the helm chart from the official helm repo as instructed here

robjcook commented 1 month ago

few node toleration things worked passed and changed to helm chart in the official helm repo

running into issue now where operator seems to be looking for image that does not exist and fails to pull

ImagePullBackOff (Back-off pulling image "nvcr.io/nvidia/driver:550.90.07-amzn2")

which image do recommend for Amazon Linux 2 node and where to specify instead of dynamically let operator interpret from the node?

edit: after digging through documentation looking into this now

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/precompiled-drivers.html