Closed jayteaftw closed 1 month ago
As there is mismatch in the kernel found on the node and the one present in the image, it is falling back to entitled-build. As my user does not have entitled-build enabled it is bound to fail.
$ omc logs nvidia-driver-daemonset-416.94.202407030122-0-qzlbs -c openshift-driver-toolkit-ctr
2024-09-16T18:30:13.295548997Z + '[' -f /mnt/shared-nvidia-driver-toolkit/dir_prepared ']'
2024-09-16T18:30:13.295548997Z + echo Waiting for nvidia-driver-ctr container to prepare the shared directory ...
2024-09-16T18:30:13.295674042Z Waiting for nvidia-driver-ctr container to prepare the shared directory ...
2024-09-16T18:30:13.295687355Z + sleep 10
2024-09-16T18:30:23.297494455Z + '[' -f /mnt/shared-nvidia-driver-toolkit/dir_prepared ']'
2024-09-16T18:30:23.297531824Z + exec /mnt/shared-nvidia-driver-toolkit/ocp_dtk_entrypoint dtk-build-driver
2024-09-16T18:30:23.300591588Z Running dtk-build-driver
2024-09-16T18:30:23.307484211Z WARNING: broken Driver Toolkit image detected:
2024-09-16T18:30:23.309150415Z - Node kernel: 5.14.0-427.33.1.el9_4.x86_64
2024-09-16T18:30:23.410705794Z - Kernel package: 5.14.0-427.24.1.el9_4.x86_64
2024-09-16T18:30:23.410705794Z INFO: informing nvidia-driver-ctr to fallback on entitled-build.
2024-09-16T18:30:23.412540096Z INFO: nothing else to do in openshift-driver-toolkit-ctr container, sleeping forever.
The dtk-build-driver currently at my user end is using this image:
[quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e5e6de7572003ac560f113a0082594a585c49d51801f028f699b15262eff7c02](http://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e5e6de7572003ac560f113a0082594a585c49d51801f028f699b15262eff7c02)
But for OCP 4.16.10 should be below which has the right 5.14.0-427.33.1.el9_4.x86_64
[quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:66d1fdd2b231a474a434a3aa603fed39137485c8f5b51d84fdd712f4b225638c](http://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:66d1fdd2b231a474a434a3aa603fed39137485c8f5b51d84fdd712f4b225638c)
$ omc get daemonset/nvidia-driver-daemonset-416.94.202407030122-0 -o yaml | less
image: [quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e5e6de7572003ac560f113a0082594a585c49d51801f028f699b15262eff7c02](http://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e5e6de7572003ac560f113a0082594a585c49d51801f028f699b15262eff7c02) <---- The daemon set is referring to old image:
imagePullPolicy: IfNotPresent
name: openshift-driver-toolkit-ctr
The daemon set is referring to old image but we do have the latest image on the cluster.
$ oc adm release info --image-for=driver-toolkit
[quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:66d1fdd2b231a474a434a3aa603fed39137485c8f5b51d84fdd712f4b225638c](http://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:66d1fdd2b231a474a434a3aa603fed39137485c8f5b51d84fdd712f4b225638c)
Q>> How do I tell the daemon set to use the new image?
@kenneth-dsouza Thanks for sharing your findings.
Can you share the gpu-operator (the main controller pod) pod logs and the node labels?
We are specifically interested in the feature.node.kubernetes.io/system-os_release.OSTREE_VERSION
label
Hello,
The node has below labels:
feature.node.kubernetes.io/system-os_release.RHEL_VERSION=9.4,feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=416.94.202407030122-0,nvidia.com/gpu.deploy.dcgm-exporter=true
From the operator log I can see it is using the old river-toolkit image.
./gpu-operator/gpu-operator/logs/current.log:2024-09-18T21:19:32.986239997Z {"level":"info","ts":"2024-09-18T21:19:32Z","logger":"controllers.ClusterPolicy","msg":"DriverToolkit","image":"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e5e6de7572003ac560f113a0082594a585c49d51801f028f699b15262eff7c02"}
~~~<!-- Failed to upload "current.log" -->
controller pod logs: ( I hope you meant gpu operator log)
I am unable to upload the logs due to security reason, anything specific you want me to check?
Thanks @kenneth-dsouza !
As per the oc adm
output below, the feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=416.94.202407030122-0
label points to OCP version 4.16.2
oc adm release info 4.16.2 --pullspecs
Name: 4.16.2
Digest: sha256:198ae5a1e59183511fbdcfeaf4d5c83a16716ed7734ac6cbeea4c47a32bffad6
Created: 2024-07-04T08:05:45Z
OS/Arch: linux/amd64
Manifests: 729
Metadata files: 1
Pull From: quay.io/openshift-release-dev/ocp-release@sha256:198ae5a1e59183511fbdcfeaf4d5c83a16716ed7734ac6cbeea4c47a32bffad6
Release Metadata:
Version: 4.16.2
Upgrades: 4.15.18, 4.15.19, 4.15.20, 4.15.21, 4.16.0, 4.16.1
Metadata:
url: https://access.redhat.com/errata/RHSA-2024:4316
Component Versions:
kubectl 1.29.1
kubernetes 1.29.6
kubernetes-tests 1.29.0
machine-os 416.94.202407030122-0 Red Hat Enterprise Linux CoreOS
The expected label for OCP 4.16.10 is
feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=416.94.202408260940-0
Since the feature.node.kubernetes.io/system-os_release.OSTREE_VERSION
node label doesn't reflect the new OCP version, it is likely an issue with the Node Feature Discovery. We probably need to check the node feature discovery deployed in the Openshift environment
Thanks for the update @tariq1890 , let me check the nfd operator and get back to you.
As there is mismatch in the kernel found on the node and the one present in the image, it is falling back to entitled-build. As my user does not have entitled-build enabled it is bound to fail.
I don't think this is indeed the case.
The packages seems to be OK in the image:
root github.com $ oc adm release info quay.io/openshift-release-dev/ocp-release:4.16.10-x86_64 --image-for=driver-toolkit
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:66d1fdd2b231a474a434a3aa603fed39137485c8f5b51d84fdd712f4b225638c
root github.com $ podman run -it --rm quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:66d1fdd2b231a474a434a3aa603fed39137485c8f5b51d84fdd712f4b225638c rpm -qa | grep kernel
kernel-headers-5.14.0-427.33.1.el9_4.x86_64
kernel-modules-core-5.14.0-427.33.1.el9_4.x86_64
kernel-core-5.14.0-427.33.1.el9_4.x86_64
kernel-modules-5.14.0-427.33.1.el9_4.x86_64
kernel-devel-5.14.0-427.33.1.el9_4.x86_64
kernel-modules-extra-5.14.0-427.33.1.el9_4.x86_64
kernel-rt-modules-core-5.14.0-427.33.1.el9_4.x86_64
kernel-rt-core-5.14.0-427.33.1.el9_4.x86_64
kernel-rt-modules-5.14.0-427.33.1.el9_4.x86_64
kernel-rt-modules-extra-5.14.0-427.33.1.el9_4.x86_64
kernel-rt-devel-5.14.0-427.33.1.el9_4.x86_64
kernel-srpm-macros-1.0-13.el9.noarch
kernel-rpm-macros-185-13.el9.noarch
oc adm release info --image-for=driver-toolkit
I am not sure how the default version is chose in this commands, therefore, I always mention the version I would like to get.
How are you consuming the DTK image the the GPU operator?
The easiest way will probably be to inspect the is/driver-toolkit
imageStream in the cluster which should contain a tag for each RHCOS versions present in the clusters.
root github.com $ oc get is/driver-toolkit -n openshift -o yaml | yq '.spec'
lookupPolicy:
local: false
tags:
- annotations: null
from:
kind: DockerImage
name: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4780dd356eb8a2a4f6779bd75eed9a47072d2c495596bf9614ed13b86efebcc1
generation: 3
importPolicy:
importMode: PreserveOriginal
scheduled: true
name: 416.94.202407030122-0
referencePolicy:
type: Source
- annotations: null
from:
kind: DockerImage
name: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4780dd356eb8a2a4f6779bd75eed9a47072d2c495596bf9614ed13b86efebcc1
generation: 3
importPolicy:
importMode: PreserveOriginal
scheduled: true
name: latest
referencePolicy:
type: Source
This IS example was taken from a 4.16.2 cluster and not a 4.16.10
(after an upgrade you will find another tag in the list and that the latest
tag was updated to a new digest).
KMM also use this IS to make sure we always use the correct DTK image when DTK_AUTO
is used in the Dockerfile - you may want to use the same approach.
https://docs.openshift.com/container-platform/4.12/hardware_enablement/kmm-kernel-module-management.html#example-dockerfile_kernel-module-management-operator
As there is mismatch in the kernel found on the node and the one present in the image, it is falling back to entitled-build. As my user does not have entitled-build enabled it is bound to fail.
I don't think this is indeed the case.
It is, as the driver tooklit image being used by the operator is
Which has 5.14.0-427.24.1.el9_4.x86_64 image, this is caused by nfd operator since the feature.node.kubernetes.io/system-os_release.OSTREE_VERSION node label doesn't reflect the new OCP version. I am trying to fix the labels on the node via NFD and see if the issue reproduces.
The packages seems to be OK in the image:
root github.com $ oc adm release info quay.io/openshift-release-dev/ocp-release:4.16.10-x86_64 --image-for=driver-toolkit quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:66d1fdd2b231a474a434a3aa603fed39137485c8f5b51d84fdd712f4b225638c root github.com $ podman run -it --rm quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:66d1fdd2b231a474a434a3aa603fed39137485c8f5b51d84fdd712f4b225638c rpm -qa | grep kernel kernel-headers-5.14.0-427.33.1.el9_4.x86_64 kernel-modules-core-5.14.0-427.33.1.el9_4.x86_64 kernel-core-5.14.0-427.33.1.el9_4.x86_64 kernel-modules-5.14.0-427.33.1.el9_4.x86_64 kernel-devel-5.14.0-427.33.1.el9_4.x86_64 kernel-modules-extra-5.14.0-427.33.1.el9_4.x86_64 kernel-rt-modules-core-5.14.0-427.33.1.el9_4.x86_64 kernel-rt-core-5.14.0-427.33.1.el9_4.x86_64 kernel-rt-modules-5.14.0-427.33.1.el9_4.x86_64 kernel-rt-modules-extra-5.14.0-427.33.1.el9_4.x86_64 kernel-rt-devel-5.14.0-427.33.1.el9_4.x86_64 kernel-srpm-macros-1.0-13.el9.noarch kernel-rpm-macros-185-13.el9.noarch
oc adm release info --image-for=driver-toolkit
I am not sure how the default version is chose in this commands, therefore, I always mention the version I would like to get.
How are you consuming the DTK image the the GPU operator? The easiest way will probably be to inspect the
is/driver-toolkit
imageStream in the cluster which should contain a tag for each RHCOS versions present in the clusters.root github.com $ oc get is/driver-toolkit -n openshift -o yaml | yq '.spec' lookupPolicy: local: false tags: - annotations: null from: kind: DockerImage name: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4780dd356eb8a2a4f6779bd75eed9a47072d2c495596bf9614ed13b86efebcc1 generation: 3 importPolicy: importMode: PreserveOriginal scheduled: true name: 416.94.202407030122-0 referencePolicy: type: Source - annotations: null from: kind: DockerImage name: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4780dd356eb8a2a4f6779bd75eed9a47072d2c495596bf9614ed13b86efebcc1 generation: 3 importPolicy: importMode: PreserveOriginal scheduled: true name: latest referencePolicy: type: Source
This IS example was taken from a 4.16.2 cluster and not a 4.16.10 (after an upgrade you will find another tag in the list and that the
latest
tag was updated to a new digest).KMM also use this IS to make sure we always use the correct DTK image when
DTK_AUTO
is used in the Dockerfile - you may want to use the same approach. https://docs.openshift.com/container-platform/4.12/hardware_enablement/kmm-kernel-module-management.html#example-dockerfile_kernel-module-management-operator
It is, as the driver tooklit image being used by the operator is
Which has 5.14.0-427.24.1.el9_4.x86_64 image, this is caused by nfd operator since the feature.node.kubernetes.io/system-os_release.OSTREE_VERSION node label doesn't reflect the new OCP version. I am trying to fix the labels on the node via NFD and see if the issue reproduces.
I see. If it helps, there is another way of finding the correct DTK image:
node.status.nodeInfo
you can find both kernelVersion: 5.14.0-427.33.1.el9_4.x86_64+rt
and osImage: Red Hat Enterprise Linux CoreOS 416.94.202408260940-0
so you know that kernel 5.14.0-427.33.1.el9_4.x86_64+rt
is used in RHCOS 416.94.202408260940-0
....
- annotations: null
from:
kind: DockerImage
name: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:66d1fdd2b231a474a434a3aa603fed39137485c8f5b51d84fdd712f4b225638c
generation: 4
importPolicy:
importMode: PreserveOriginal
scheduled: true
name: 416.94.202408260940-0
referencePolicy:
type: Source
...
therefore, RHCOS 416.94.202408260940-0
is using the quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:66d1fdd2b231a474a434a3aa603fed39137485c8f5b51d84fdd712f4b225638c
DTK image.
BTW, KMM is doing all of that automatically by using the DTK_AUTO
base image in the dockerfile. I know that NVIDIA isn't using DTK but I wrote that just in case it help somehow.
Team, update: The issue has been resolved after fixing nfd issue :) Once the nfd updated the right labels, the right dtk image is used and builds are not falling back to entitlement build.
@kenneth-dsouza @jayteaftw happy to hear the issue has been resolved. Can you provide details on what the issue was with NFD and how you resolved it?
@cdesiniotis the nfd master pod was not coming up due to below error:
container has runAsNonRoot and image will run as root
As wrong scc was picked by it, once the scc issue was resolved, the nfd master came up and updated the node labels. Which Nvidia picked and right driver toolkit image was referred.
Upgraded from Openshift 4.16.2 to Openshift 4.16.10. After upgrading the nvidia-gpu-operator fails to start. Specifically the nvidia-driver-daemonset-416.94.202407030122 is in a Crash loop back off. It looks like it is a kernel problem. nvidia-driver-daemonset-416.94.202407030122-0-cfh5w-nvidia-driver-ctr.log