NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.83k stars 297 forks source link

Fatal Error: Openshift 4.16.10 not compatible with Nvidia-GPU-Operator-24.6.1 #990

Closed jayteaftw closed 1 month ago

jayteaftw commented 1 month ago

Upgraded from Openshift 4.16.2 to Openshift 4.16.10. After upgrading the nvidia-gpu-operator fails to start. Specifically the nvidia-driver-daemonset-416.94.202407030122 is in a Crash loop back off. It looks like it is a kernel problem. nvidia-driver-daemonset-416.94.202407030122-0-cfh5w-nvidia-driver-ctr.log

kenneth-dsouza commented 1 month ago

As there is mismatch in the kernel found on the node and the one present in the image, it is falling back to entitled-build. As my user does not have entitled-build enabled it is bound to fail.

$ omc logs nvidia-driver-daemonset-416.94.202407030122-0-qzlbs -c openshift-driver-toolkit-ctr
2024-09-16T18:30:13.295548997Z + '[' -f /mnt/shared-nvidia-driver-toolkit/dir_prepared ']'
2024-09-16T18:30:13.295548997Z + echo Waiting for nvidia-driver-ctr container to prepare the shared directory ...
2024-09-16T18:30:13.295674042Z Waiting for nvidia-driver-ctr container to prepare the shared directory ...
2024-09-16T18:30:13.295687355Z + sleep 10
2024-09-16T18:30:23.297494455Z + '[' -f /mnt/shared-nvidia-driver-toolkit/dir_prepared ']'
2024-09-16T18:30:23.297531824Z + exec /mnt/shared-nvidia-driver-toolkit/ocp_dtk_entrypoint dtk-build-driver
2024-09-16T18:30:23.300591588Z Running dtk-build-driver
2024-09-16T18:30:23.307484211Z WARNING: broken Driver Toolkit image detected:
2024-09-16T18:30:23.309150415Z - Node kernel:    5.14.0-427.33.1.el9_4.x86_64
2024-09-16T18:30:23.410705794Z - Kernel package: 5.14.0-427.24.1.el9_4.x86_64
2024-09-16T18:30:23.410705794Z INFO: informing nvidia-driver-ctr to fallback on entitled-build.
2024-09-16T18:30:23.412540096Z INFO: nothing else to do in openshift-driver-toolkit-ctr container, sleeping forever.

The dtk-build-driver currently at my user end is using this image:

[quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e5e6de7572003ac560f113a0082594a585c49d51801f028f699b15262eff7c02](http://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e5e6de7572003ac560f113a0082594a585c49d51801f028f699b15262eff7c02)

But for OCP 4.16.10 should be below which has the right 5.14.0-427.33.1.el9_4.x86_64

[quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:66d1fdd2b231a474a434a3aa603fed39137485c8f5b51d84fdd712f4b225638c](http://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:66d1fdd2b231a474a434a3aa603fed39137485c8f5b51d84fdd712f4b225638c)
$ omc get daemonset/nvidia-driver-daemonset-416.94.202407030122-0  -o yaml | less
        image: [quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e5e6de7572003ac560f113a0082594a585c49d51801f028f699b15262eff7c02](http://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e5e6de7572003ac560f113a0082594a585c49d51801f028f699b15262eff7c02) <---- The daemon set is referring to old image:
        imagePullPolicy: IfNotPresent
        name: openshift-driver-toolkit-ctr

The daemon set is referring to old image but we do have the latest image on the cluster.

$ oc adm release info --image-for=driver-toolkit
[quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:66d1fdd2b231a474a434a3aa603fed39137485c8f5b51d84fdd712f4b225638c](http://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:66d1fdd2b231a474a434a3aa603fed39137485c8f5b51d84fdd712f4b225638c)

Q>> How do I tell the daemon set to use the new image?

tariq1890 commented 1 month ago

@kenneth-dsouza Thanks for sharing your findings.

Can you share the gpu-operator (the main controller pod) pod logs and the node labels?

We are specifically interested in the feature.node.kubernetes.io/system-os_release.OSTREE_VERSION label

kenneth-dsouza commented 1 month ago

Hello,

The node has below labels:


feature.node.kubernetes.io/system-os_release.RHEL_VERSION=9.4,feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=416.94.202407030122-0,nvidia.com/gpu.deploy.dcgm-exporter=true

From the operator log I can see it is using the old river-toolkit image.


./gpu-operator/gpu-operator/logs/current.log:2024-09-18T21:19:32.986239997Z {"level":"info","ts":"2024-09-18T21:19:32Z","logger":"controllers.ClusterPolicy","msg":"DriverToolkit","image":"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e5e6de7572003ac560f113a0082594a585c49d51801f028f699b15262eff7c02"}
~~~<!-- Failed to upload "current.log" -->

controller pod logs: ( I hope you meant gpu operator log)

I am unable to upload the logs due to security reason, anything specific you want me to check?
tariq1890 commented 1 month ago

Thanks @kenneth-dsouza !

As per the oc adm output below, the feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=416.94.202407030122-0 label points to OCP version 4.16.2

oc adm release info 4.16.2 --pullspecs
Name:           4.16.2
Digest:         sha256:198ae5a1e59183511fbdcfeaf4d5c83a16716ed7734ac6cbeea4c47a32bffad6
Created:        2024-07-04T08:05:45Z
OS/Arch:        linux/amd64
Manifests:      729
Metadata files: 1

Pull From: quay.io/openshift-release-dev/ocp-release@sha256:198ae5a1e59183511fbdcfeaf4d5c83a16716ed7734ac6cbeea4c47a32bffad6

Release Metadata:
  Version:  4.16.2
  Upgrades: 4.15.18, 4.15.19, 4.15.20, 4.15.21, 4.16.0, 4.16.1
  Metadata:
    url: https://access.redhat.com/errata/RHSA-2024:4316

Component Versions:
  kubectl          1.29.1
  kubernetes       1.29.6
  kubernetes-tests 1.29.0
  machine-os       416.94.202407030122-0 Red Hat Enterprise Linux CoreOS

The expected label for OCP 4.16.10 is

feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=416.94.202408260940-0

tariq1890 commented 1 month ago

Since the feature.node.kubernetes.io/system-os_release.OSTREE_VERSION node label doesn't reflect the new OCP version, it is likely an issue with the Node Feature Discovery. We probably need to check the node feature discovery deployed in the Openshift environment

kenneth-dsouza commented 1 month ago

Thanks for the update @tariq1890 , let me check the nfd operator and get back to you.

ybettan commented 1 month ago

As there is mismatch in the kernel found on the node and the one present in the image, it is falling back to entitled-build. As my user does not have entitled-build enabled it is bound to fail.

I don't think this is indeed the case.

The packages seems to be OK in the image:

root github.com $ oc adm release info quay.io/openshift-release-dev/ocp-release:4.16.10-x86_64 --image-for=driver-toolkit
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:66d1fdd2b231a474a434a3aa603fed39137485c8f5b51d84fdd712f4b225638c

root github.com $ podman run -it --rm quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:66d1fdd2b231a474a434a3aa603fed39137485c8f5b51d84fdd712f4b225638c rpm -qa | grep kernel
kernel-headers-5.14.0-427.33.1.el9_4.x86_64
kernel-modules-core-5.14.0-427.33.1.el9_4.x86_64
kernel-core-5.14.0-427.33.1.el9_4.x86_64
kernel-modules-5.14.0-427.33.1.el9_4.x86_64
kernel-devel-5.14.0-427.33.1.el9_4.x86_64
kernel-modules-extra-5.14.0-427.33.1.el9_4.x86_64
kernel-rt-modules-core-5.14.0-427.33.1.el9_4.x86_64
kernel-rt-core-5.14.0-427.33.1.el9_4.x86_64
kernel-rt-modules-5.14.0-427.33.1.el9_4.x86_64
kernel-rt-modules-extra-5.14.0-427.33.1.el9_4.x86_64
kernel-rt-devel-5.14.0-427.33.1.el9_4.x86_64
kernel-srpm-macros-1.0-13.el9.noarch
kernel-rpm-macros-185-13.el9.noarch

oc adm release info --image-for=driver-toolkit

I am not sure how the default version is chose in this commands, therefore, I always mention the version I would like to get.

How are you consuming the DTK image the the GPU operator? The easiest way will probably be to inspect the is/driver-toolkit imageStream in the cluster which should contain a tag for each RHCOS versions present in the clusters.

root github.com $ oc get is/driver-toolkit -n openshift -o yaml | yq '.spec'
lookupPolicy:
  local: false
tags:
  - annotations: null
    from:
      kind: DockerImage
      name: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4780dd356eb8a2a4f6779bd75eed9a47072d2c495596bf9614ed13b86efebcc1
    generation: 3
    importPolicy:
      importMode: PreserveOriginal
      scheduled: true
    name: 416.94.202407030122-0
    referencePolicy:
      type: Source
  - annotations: null
    from:
      kind: DockerImage
      name: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4780dd356eb8a2a4f6779bd75eed9a47072d2c495596bf9614ed13b86efebcc1
    generation: 3
    importPolicy:
      importMode: PreserveOriginal
      scheduled: true
    name: latest
    referencePolicy:
      type: Source

This IS example was taken from a 4.16.2 cluster and not a 4.16.10 (after an upgrade you will find another tag in the list and that the latest tag was updated to a new digest).

KMM also use this IS to make sure we always use the correct DTK image when DTK_AUTO is used in the Dockerfile - you may want to use the same approach. https://docs.openshift.com/container-platform/4.12/hardware_enablement/kmm-kernel-module-management.html#example-dockerfile_kernel-module-management-operator

kenneth-dsouza commented 1 month ago

As there is mismatch in the kernel found on the node and the one present in the image, it is falling back to entitled-build. As my user does not have entitled-build enabled it is bound to fail.

I don't think this is indeed the case.

It is, as the driver tooklit image being used by the operator is

quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e5e6de7572003ac560f113a0082594a585c49d51801f028f699b15262eff7c02

Which has 5.14.0-427.24.1.el9_4.x86_64 image, this is caused by nfd operator since the feature.node.kubernetes.io/system-os_release.OSTREE_VERSION node label doesn't reflect the new OCP version. I am trying to fix the labels on the node via NFD and see if the issue reproduces.

The packages seems to be OK in the image:

root github.com $ oc adm release info quay.io/openshift-release-dev/ocp-release:4.16.10-x86_64 --image-for=driver-toolkit
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:66d1fdd2b231a474a434a3aa603fed39137485c8f5b51d84fdd712f4b225638c

root github.com $ podman run -it --rm quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:66d1fdd2b231a474a434a3aa603fed39137485c8f5b51d84fdd712f4b225638c rpm -qa | grep kernel
kernel-headers-5.14.0-427.33.1.el9_4.x86_64
kernel-modules-core-5.14.0-427.33.1.el9_4.x86_64
kernel-core-5.14.0-427.33.1.el9_4.x86_64
kernel-modules-5.14.0-427.33.1.el9_4.x86_64
kernel-devel-5.14.0-427.33.1.el9_4.x86_64
kernel-modules-extra-5.14.0-427.33.1.el9_4.x86_64
kernel-rt-modules-core-5.14.0-427.33.1.el9_4.x86_64
kernel-rt-core-5.14.0-427.33.1.el9_4.x86_64
kernel-rt-modules-5.14.0-427.33.1.el9_4.x86_64
kernel-rt-modules-extra-5.14.0-427.33.1.el9_4.x86_64
kernel-rt-devel-5.14.0-427.33.1.el9_4.x86_64
kernel-srpm-macros-1.0-13.el9.noarch
kernel-rpm-macros-185-13.el9.noarch

oc adm release info --image-for=driver-toolkit

I am not sure how the default version is chose in this commands, therefore, I always mention the version I would like to get.

How are you consuming the DTK image the the GPU operator? The easiest way will probably be to inspect the is/driver-toolkit imageStream in the cluster which should contain a tag for each RHCOS versions present in the clusters.

root github.com $ oc get is/driver-toolkit -n openshift -o yaml | yq '.spec'
lookupPolicy:
  local: false
tags:
  - annotations: null
    from:
      kind: DockerImage
      name: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4780dd356eb8a2a4f6779bd75eed9a47072d2c495596bf9614ed13b86efebcc1
    generation: 3
    importPolicy:
      importMode: PreserveOriginal
      scheduled: true
    name: 416.94.202407030122-0
    referencePolicy:
      type: Source
  - annotations: null
    from:
      kind: DockerImage
      name: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4780dd356eb8a2a4f6779bd75eed9a47072d2c495596bf9614ed13b86efebcc1
    generation: 3
    importPolicy:
      importMode: PreserveOriginal
      scheduled: true
    name: latest
    referencePolicy:
      type: Source

This IS example was taken from a 4.16.2 cluster and not a 4.16.10 (after an upgrade you will find another tag in the list and that the latest tag was updated to a new digest).

KMM also use this IS to make sure we always use the correct DTK image when DTK_AUTO is used in the Dockerfile - you may want to use the same approach. https://docs.openshift.com/container-platform/4.12/hardware_enablement/kmm-kernel-module-management.html#example-dockerfile_kernel-module-management-operator

ybettan commented 1 month ago

It is, as the driver tooklit image being used by the operator is

quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e5e6de7572003ac560f113a0082594a585c49d51801f028f699b15262eff7c02

Which has 5.14.0-427.24.1.el9_4.x86_64 image, this is caused by nfd operator since the feature.node.kubernetes.io/system-os_release.OSTREE_VERSION node label doesn't reflect the new OCP version. I am trying to fix the labels on the node via NFD and see if the issue reproduces.

I see. If it helps, there is another way of finding the correct DTK image:

  1. On the node.status.nodeInfo you can find both kernelVersion: 5.14.0-427.33.1.el9_4.x86_64+rt and osImage: Red Hat Enterprise Linux CoreOS 416.94.202408260940-0 so you know that kernel 5.14.0-427.33.1.el9_4.x86_64+rt is used in RHCOS 416.94.202408260940-0.
  2. Then, in the driver-toolkit IS you can see
    ...
    - annotations: null
    from:
      kind: DockerImage
      name: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:66d1fdd2b231a474a434a3aa603fed39137485c8f5b51d84fdd712f4b225638c
    generation: 4
    importPolicy:
      importMode: PreserveOriginal
      scheduled: true
    name: 416.94.202408260940-0
    referencePolicy:
      type: Source
    ...

    therefore, RHCOS 416.94.202408260940-0 is using the quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:66d1fdd2b231a474a434a3aa603fed39137485c8f5b51d84fdd712f4b225638c DTK image.

BTW, KMM is doing all of that automatically by using the DTK_AUTO base image in the dockerfile. I know that NVIDIA isn't using DTK but I wrote that just in case it help somehow.

kenneth-dsouza commented 1 month ago

Team, update: The issue has been resolved after fixing nfd issue :) Once the nfd updated the right labels, the right dtk image is used and builds are not falling back to entitlement build.

cdesiniotis commented 1 month ago

@kenneth-dsouza @jayteaftw happy to hear the issue has been resolved. Can you provide details on what the issue was with NFD and how you resolved it?

kenneth-dsouza commented 1 month ago

@cdesiniotis the nfd master pod was not coming up due to below error:

container has runAsNonRoot and image will run as root

As wrong scc was picked by it, once the scc issue was resolved, the nfd master came up and updated the node labels. Which Nvidia picked and right driver toolkit image was referred.