NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.74k stars 281 forks source link

Redhat Openshift alerts GPUOperatorOpenshiftDriverToolkitEnabledNfdTooOld #319

Open tnakajo opened 2 years ago

tnakajo commented 2 years ago

1. Issue description

Redhat Openshift (ROKS on IBM Cloud) gets the alert GPUOperatorOpenshiftDriverToolkitEnabledNfdTooOld with the Description
The DriverToolkit is enabled in the GPU Operator ClusterPolicy, but the NFD version deployed in the cluster is too old to support it. after configuring the NVIDIA GPU Operator by the instruction: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/contents.html (additionally https://github.ibm.com/aivision/notebook/blob/master/gpu-operator/README.md#important-for-rhel-worker-node-or-ibm-cloud-users)

2. Steps to reproduce the issue

Followed the instruction to install the NVIDIA GPU Operator: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/contents.html

  1. Installing the Node Feature Discovery (NFD) Operator.
  2. Installing the NVIDIA GPU Operator.
  3. The OCP console reports the alert GPUOperatorOpenshiftDriverToolkitEnabledNfdTooOld with the Description
    The DriverToolkit is enabled in the GPU Operator ClusterPolicy, but the NFD version deployed in the cluster is too old to support it..

3. Information

oc logs cuda-vectoradd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

tnakajo commented 2 years ago

Screenshot 2022-02-02 at 08 23 18 Screenshot 2022-02-02 at 08 22 47

shivamerla commented 2 years ago

@tnakajo With v1.9 we have added feature to avoid dependency on Cluster wide entitlements to install NVIDIA driver. On 4.9 and certain z-stream versions of 4.8, NFD adds a special label to enable this feature. This alert is basically indicating that this feature is not available with your installed versions of 4.8.x. It has been noted here: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/steps-overview.html#entitlement-free-supported-versions. This doesn't affect any functionality, but just that cluster wide entitlements have to be active when driver is getting installed.

tnakajo commented 2 years ago

@shivamerla Thank you for the very quick response and the clarification.

donovat commented 2 years ago

@shivamerla - I am seeing the same error message in my Red Hat cluster, also on IBM's cloud and with whats looks to be the same levels of code. I also don't fully understand the included link above. As far as I can see from the tests, we have cluster wide entitlements active. I also see in the logs for the nvidia-driver-daemonset-xxx a number of warnings including the message; /usr/local/bin/nvidia-driver: line 98: OPENSHIFT_VERSION: unbound variable. Although exits with a return 0. But both the pods relating to the daemonset are in CrashLoopBackOff,

shivamerla commented 2 years ago

@donovat If you have ClusterWide entitlements applied then you can ignore that warning message. What that error means is with updated OCP versions 4.9+ entitlements could be avoided.

Regarding driver error, GPU Operator will set this env, please confirm the operator version you are using, and i will double check if it was missing with older versions.


          // Add env vars needed by nvidia-driver to enable the right releasever and EUS rpm repos
      rhelVersion := corev1.EnvVar{Name: "RHEL_VERSION", Value: release["RHEL_VERSION"]}
      ocpVersion := corev1.EnvVar{Name: "OPENSHIFT_VERSION", Value: release["OPENSHIFT_VERSION"]}
donovat commented 2 years ago

Hi @shivamerla - I have installed NVIDIA GPU Operator 1.9.1 provided by NVIDIA Corporation on OpenShift version 4.8.26

Would I expect to see these env vars reported in the pod logs? - could not see them in any logs so far.

shivamerla commented 2 years ago

Can you describe the driver pod and see if those env are added by operator?

donovat commented 2 years ago

Hi @shivamerla, This is the info I can see from the describe... $ oc describe pod/nvidia-driver-daemonset-67fsd Name: nvidia-driver-daemonset-67fsd Namespace: nvidia-gpu-operator Priority: 2000001000 Priority Class Name: system-node-critical Node: 10.185.173.143/10.185.173.143 Start Time: Wed, 23 Feb 2022 13:22:02 +0000 Labels: app=nvidia-driver-daemonset controller-revision-hash=5557bbb9bb pod-template-generation=1 Annotations: cni.projectcalico.org/containerID: 8d499c370d9b92e504486d4ae46cc79f2f728e4b50494aa521c9d0924f8de688 cni.projectcalico.org/podIP: 172.30.234.212/32 cni.projectcalico.org/podIPs: 172.30.234.212/32 k8s.v1.cni.cncf.io/network-status: [{ "name": "k8s-pod-network", "ips": [ "172.30.234.212" ], "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: [{ "name": "k8s-pod-network", "ips": [ "172.30.234.212" ], "default": true, "dns": {} }] openshift.io/scc: nvidia-driver Status: Running IP: 172.30.234.212 IPs: IP: 172.30.234.212 Controlled By: DaemonSet/nvidia-driver-daemonset Init Containers: k8s-driver-manager: Container ID: cri-o://b8a480479e8e986629ffdde9d6488e833451caf21754ee5a3345a0ce5ea5f987 Image: nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:54233ebccbc3d2b388b237031907d58c3719d0e6f3ecb874349c91e8145225d2 Image ID: nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:54233ebccbc3d2b388b237031907d58c3719d0e6f3ecb874349c91e8145225d2 Port: Host Port: Command: driver-manager Args: uninstall_driver State: Terminated Reason: Completed Exit Code: 0 Started: Wed, 23 Feb 2022 13:22:06 +0000 Finished: Wed, 23 Feb 2022 13:22:50 +0000 Ready: True Restart Count: 0 Environment: NODE_NAME: (v1:spec.nodeName) NVIDIA_VISIBLE_DEVICES: void ENABLE_AUTO_DRAIN: true DRAIN_USE_FORCE: false DRAIN_POD_SELECTOR_LABEL:
DRAIN_TIMEOUT_SECONDS: 0s DRAIN_DELETE_EMPTYDIR_DATA: false OPERATOR_NAMESPACE: nvidia-gpu-operator (v1:metadata.namespace) Mounts: /run/nvidia from run-nvidia (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-429kh (ro) Containers: nvidia-driver-ctr: Container ID: cri-o://890eccbf09161a8c41ba0ffa70c34729b968a8f88dd826087930fa8749abb98f Image: nvcr.io/nvidia/driver:450.80.02-rhel7.9 Image ID: nvcr.io/nvidia/driver@sha256:a91ede2efc8d0d94bc6fe71fddec41b19269afd56ed277ec6de69c3fa1e2ebc9 Port: Host Port: Command: nvidia-driver Args: init State: Running Started: Wed, 23 Feb 2022 13:22:51 +0000 Ready: True Restart Count: 0 Environment: Mounts: /dev/log from dev-log (rw) /host-etc/os-release from host-os-release (ro) /run/mellanox/drivers from run-mellanox-drivers (rw) /run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw) /run/nvidia from run-nvidia (rw) /run/nvidia-topologyd from run-nvidia-topologyd (rw) /var/log from var-log (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-429kh (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: run-nvidia: Type: HostPath (bare host directory volume) Path: /run/nvidia HostPathType: DirectoryOrCreate var-log: Type: HostPath (bare host directory volume) Path: /var/log HostPathType:
dev-log: Type: HostPath (bare host directory volume) Path: /dev/log HostPathType:
host-os-release: Type: HostPath (bare host directory volume) Path: /etc/os-release HostPathType:
run-nvidia-topologyd: Type: HostPath (bare host directory volume) Path: /run/nvidia-topologyd HostPathType: DirectoryOrCreate mlnx-ofed-usr-src: Type: HostPath (bare host directory volume) Path: /run/mellanox/drivers/usr/src HostPathType: DirectoryOrCreate run-mellanox-drivers: Type: HostPath (bare host directory volume) Path: /run/mellanox/drivers HostPathType: DirectoryOrCreate run-nvidia-validations: Type: HostPath (bare host directory volume) Path: /run/nvidia/validations HostPathType: DirectoryOrCreate kube-api-access-429kh: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: QoS Class: BestEffort Node-Selectors: nvidia.com/gpu.deploy.driver=true Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists nvidia.com/gpu:NoSc

shivamerla commented 2 years ago

@donovat Looks like i found the issue, we are using hosts /etc/os-release to fetch the RHEL_VERSION and OPENSHIFT_VERSION fields, but looks like on RHEL nodes OPENSHIFT_VERSION is not available in that file and only with CoreOS nodes. Will need to update this logic. Meanwhile you can edit the driver Daemonset to pass env OPENSHIT_VERSION=4.7 to workaround.

kpouget commented 2 years ago

Meanwhile you can edit the driver Daemonset to pass env OPENSHIT_VERSION=4.7 to workaround.

@shivamerla editing the DaemonSet won't work, it will be reverted by the operator reconciliation, right?

Operator will overwrite only if the driver spec in ClusterPolicy is changed. We add an annotation when we create DS initially with nvidia.com/last-applied-hash=<hash>. The hash changes only when desired spec changes in ClusterPolicy.