Open KodieGlosserIBM opened 8 months ago
Emailed must gather to operator_feedback@nvidia.com
I think this problem is RHEL or OpenShift specific. I have K8s 1.25.5 running on CRI-O 1.25.1 (runc) on Rocky Linux 8.7, GPU operator runs without issues
@Zveroloff have you tried upgrade cri-o to cri-o-1.25.5-10
. This is something we just started recently seeing after this last version bump.
@fabiendupont can you help to address this issue in CRI-O which is causing subscription mounts to fail.
Hello everyone!
The work on the CRI-O's side (via https://github.com/cri-o/cri-o/issues/7880) has been completed already.
There should be no more issues with CRI-O 1.25 and 1.26 (newer releases of CRI-O were not affected) that would prevent this operator from being run.
Thanks for the update @kwilczynski
@francisguillier, your issue appears to be unrelated to the problem we have here.
Hopefully, you were able to resolve it.
@kwilczynski - I saw that the fix was backported to 4.12.54 RHSA
I've updated a cluster to this newer version and still see the issue present:
Worker Info:
cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com Ready,SchedulingDisabled canary,worker 27h v1.25.16+9946c63 169.60.156.4 <none> Red Hat Enterprise Linux CoreOS 412.86.202403280709-0 (Ootpa) 4.18.0-372.98.1.el8_6.x86_64 cri-o://1.25.5-13.1.rhaos4.12.git76343da.el8
Nvidia Pods on Worker:
╰$ oc get pods -o wide -A | grep cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com | grep nvidia
nvidia-gpu-operator gpu-feature-discovery-rcdcf 0/1 Init:0/1 0 162m 10.130.2.16 cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com <none> <none>
nvidia-gpu-operator nvidia-container-toolkit-daemonset-lxpm7 0/1 Init:0/1 0 162m 10.130.2.17 cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com <none> <none>
nvidia-gpu-operator nvidia-dcgm-exporter-n227d 0/1 Init:0/1 0 162m 169.60.156.4 cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com <none> <none>
nvidia-gpu-operator nvidia-dcgm-kr6v8 0/1 Init:0/1 0 162m 169.60.156.4 cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com <none> <none>
nvidia-gpu-operator nvidia-device-plugin-daemonset-stgwm 0/1 Init:0/1 0 162m 10.130.2.19 cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com <none> <none>
nvidia-gpu-operator nvidia-driver-daemonset-kz5th 0/1 CrashLoopBackOff 294 (53s ago) 26h 10.130.2.2 cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com <none> <none>
nvidia-gpu-operator nvidia-node-status-exporter-rcmdj 1/1 Running 3 28h 10.130.2.5 cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com <none> <none>
nvidia-gpu-operator nvidia-operator-validator-hpkvt 0/1 Init:0/4 0 162m 10.130.2.18 cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com <none> <none>
The output of the failing pod shows the same error as before:
+ echo 'Installing elfutils...'
156
Installing elfutils...
157
+ dnf install -q -y elfutils-libelf.x86_64 elfutils-libelf-devel.x86_64
158
Error: Unable to find a match: elfutils-libelf-devel.x86_64
159
FATAL: failed to install elfutils packages. RHEL entitlement may be improperly deployed.
160
+ echo 'FATAL: failed to install elfutils packages. RHEL entitlement may be improperly deployed.'
161
+ exit 1
162
++ rm -rf /tmp/tmp.AIojKsyUdp
@jmkanz a couple of things:
Found 0 entitlement certificates
), but that's a different one than you are hitting@jmkanz can you post the status of all pods in the cluster please (specially coredns).
cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com Ready,SchedulingDisabled canary,worker 27h v1.25.16+9946c63 169.60.156.4 <none> Red Hat Enterprise Linux CoreOS 412.86.202403280709-0 (Ootpa) 4.18.0-372.98.1.el8_6.x86_64 cri-o://1.25.5-13.1.rhaos4.12.git76343da.el8
GPU Operator does seem to cordon the node in this case, so wondering if any networking pods are being evicted, which will cause the driver install to fail.
@shivamerla - I've manually cordoned this node since i've updated it to the latest version of Open Shift 4.12
The cordon should not impact the nvidia pods as they run as daemonsets. I've cordoned other nodes in the cluster as well (that are on a older version of CoreOS) and they run fine with or without the cordon.
Additionally, other pods are running fine on the node. I can give you an output of them. Please see below: edit to sanitize IP's
oc get pods -A -o wide |grep cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com
ibm-object-s3fs ibmcloud-object-storage-driver-l925j 1/1 Running 0 126m
ibm-observe logdna-agent-8xssh 1/1 Running 3 2d3h
ibm-observe sysdig-agent-n6v9s 1/1 Running 3 2d3h
jeg kernel-image-puller-5bfpp 1/1 Running 3 2d3h
kube-system istio-cni-node-cl5sj 1/1 Running 3 2d3h
nvidia-gpu-operator gpu-feature-discovery-rcdcf 0/1 Init:0/1 0 25h
nvidia-gpu-operator nvidia-container-toolkit-daemonset-lxpm7 0/1 Init:0/1 0
nvidia-gpu-operator nvidia-dcgm-exporter-n227d 0/1 Init:0/1 0 25h
nvidia-gpu-operator nvidia-dcgm-kr6v8 0/1 Init:0/1 0 25h
nvidia-gpu-operator nvidia-device-plugin-daemonset-stgwm 0/1 Init:0/1 0 25h
nvidia-gpu-operator nvidia-driver-daemonset-kz5th 0/1 CrashLoopBackOff 558 (14s ago) 2d1h
nvidia-gpu-operator nvidia-node-status-exporter-rcmdj 1/1 Running 3 2d3h
nvidia-gpu-operator nvidia-operator-validator-hpkvt 0/1 Init:0/4 0 25h
openshift-cluster-node-tuning-operator tuned-vnkmb 1/1 Running 1 47h
openshift-dns dns-default-4vdm9 2/2 Running 2 47h
openshift-dns node-resolver-l5vtp 1/1 Running 1 47h
openshift-image-registry node-ca-zmnhr 1/1 Running 1 47h
openshift-ingress-canary ingress-canary-xrdwz 1/1 Running 1 47h
openshift-machine-config-operator machine-config-daemon-bpj6x 2/2 Running 2
openshift-monitoring node-exporter-djqxn 2/2 Running 2 47h
openshift-multus multus-additional-cni-plugins-zhw2l 1/1 Running 1 47h
openshift-multus multus-ddwks 1/1 Running 1 47h
openshift-multus network-metrics-daemon-cbj56 2/2 Running 2 47h
openshift-network-diagnostics network-check-target-c6snb 1/1 Running 1 47h
openshift-nfd nfd-worker-pflkc 1/1 Running 3 2d3h
openshift-sdn sdn-j7kng 2/2 Running 2 47h
openshift-storage csi-cephfsplugin-fndsj 2/2 Running 6 2d3h
openshift-storage csi-rbdplugin-5c6ml 3/3 Running 9 2d3h
tekton-pipelines pwa-2r24x 1/1 Running 3 2d3h
DNS Pods for the cluster as well:
╰$ oc get pods -A |grep dns
openshift-dns-operator dns-operator-7f86f6f997-766l4 2/2 Running 0 47h
openshift-dns dns-default-4n6hq 2/2 Running 0 47h
openshift-dns dns-default-4vdm9 2/2 Running 2 47h
openshift-dns dns-default-7v9rx 2/2 Running 0 47h
openshift-dns dns-default-9wwps 2/2 Running 0 47h
openshift-dns dns-default-bcv7s 2/2 Running 2 47h
openshift-dns dns-default-bzsmp 2/2 Running 0 47h
openshift-dns dns-default-csrpd 2/2 Running 0 47h
openshift-dns dns-default-d677l 2/2 Running 0 47h
openshift-dns dns-default-dv45x 2/2 Running 0 47h
openshift-dns dns-default-j7xcv 2/2 Running 4 47h
openshift-dns dns-default-jb62l 2/2 Running 0 47h
openshift-dns dns-default-lkq76 2/2 Running 2 47h
openshift-dns dns-default-lpsfq 2/2 Running 0 47h
openshift-dns dns-default-m6hr9 2/2 Running 0 47h
openshift-dns dns-default-pf825 2/2 Running 0 47h
openshift-dns dns-default-tj4bw 2/2 Running 0 47h
openshift-dns dns-default-zjpsz 2/2 Running 2 47h
openshift-dns dns-default-zl52j 2/2 Running 0 47h
openshift-dns dns-default-zsgx8 2/2 Running 0 47h
openshift-dns node-resolver-2vsc8 1/1 Running 1 47h
openshift-dns node-resolver-48nkh 1/1 Running 0 47h
openshift-dns node-resolver-59vb8 1/1 Running 0 47h
openshift-dns node-resolver-74btd 1/1 Running 0 47h
openshift-dns node-resolver-c5d4p 1/1 Running 0 47h
openshift-dns node-resolver-c5q44 1/1 Running 0 47h
openshift-dns node-resolver-clck8 1/1 Running 0 47h
openshift-dns node-resolver-fjgnb 1/1 Running 1 47h
openshift-dns node-resolver-g54sd 1/1 Running 0 47h
openshift-dns node-resolver-gd6rk 1/1 Running 0 47h
openshift-dns node-resolver-gs4z2 1/1 Running 2 47h
openshift-dns node-resolver-l5n4z 1/1 Running 0 47h
openshift-dns node-resolver-l5vtp 1/1 Running 1 47h
openshift-dns node-resolver-l9kjc 1/1 Running 0 47h
openshift-dns node-resolver-rls8p 1/1 Running 1 47h
openshift-dns node-resolver-tr6wf 1/1 Running 0 47h
openshift-dns node-resolver-vlnj4 1/1 Running 0 47h
openshift-dns node-resolver-whzp6 1/1 Running 0 47h
openshift-dns node-resolver-wrnfs 1/1 Running 0 47h
openshift-dns node-resolver-xqwxs 1/1 Running 0 47h
openshift-dns node-resolver-zhb5c 1/1 Running 0 47h
openshift-dns node-resolver-zm4m8 1/1 Running 0 47h
openshift-dns node-resolver-zph8p 1/1 Running 0 47h
@jmkanz a couple of things:
- a better forum may be (if possible) an openshift Jira ticket, as this forum is really more for upstream cri-o, and these versions are out of upstream support
can you help me put together a more minimal reproducer? I attempted to install the nvidia operator, and created a clusterpolicy and nvidia driver instance, but I wonder if I did the right steps as I'm getting different failures (and I doubt the cluster I installed has GPUs to provision)
- I also tried to use a ubi8 image and I was able to install packages (elfutils was installed in ubi8 base, but I could install other packages, and I could also install it in ubi8-minimal with microdnf). I do get warnings about not having entitlement certs (
Found 0 entitlement certificates
), but that's a different one than you are hitting
Hey @haircommander - Thanks for your reply. We can move the conversation over to the other git issue in the CRI-O repo if you prefer? This is the NVIDIA one. Additionally, if your cluster doesn't have GPU's I doubt the install will even begin since you need the correct labels from NFD operator for GPU enabled workers.
I believe @KodieGlosserIBM has the issue open in JIRA with Red Hat still
ah I thought I was commenting there :upside_down_face: . this is fine too if this feels right.
Still wondering about a more minimal reproducer, potentially without nvidia operator in the picture. Or, if you could help me get access to the environment with this failing, that would work too
This seems to be resolved. I noticed this cluster had a ClusterPolicy that was not using the OCP Driver Toolkit which prevents these entitlement issues.
Please see links below for more information
Open Shift Driver Toolkit Info: https://docs.openshift.com/container-platform/4.12/hardware_enablement/psap-driver-toolkit.html
NVIDIA Docs on Installation with / without Driver Toolkit: https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/steps-overview.html
@jmkanz and @KodieGlosserIBM, thank you for the update!
Good to know that things are working fine. :tada:
Hello everyone! :wave: Are we still having issues with the operator installation on CRI-O 1.25 and 1.26?
I think this problem has been resolved, and we could close this issue? Thoughts?
Do we still need this issue to be open? Any more troubles? I believe we can resolve it now.
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
tl;dr at the bottom
1. Quick Debug Information
4.18.0-513.18.1.el8_9.x86_64
2. Issue or feature description
Briefly explain the issue in terms of expected behavior and current behavior. Pulling in the most recent cri-o changes on OCP 4.12/4.13 https://github.com/cri-o/cri-o/compare/1b1a520...8724c4d CRI-O:
cri-o-1.25.5-10.rhaos4.12.git8724c4d.el8
cri-o-1.26.5-7.rhaos4.13.git692ef91.el8
GPU installer is failling to install
elfutils
3. Steps to reproduce the issue
Detailed steps to reproduce the issue.
Use container runtime crio-o on version(s)
cri-o-1.25.5-10.rhaos4.12.git8724c4d.el8
cri-o-1.26.5-7.rhaos4.13.git692ef91.el8
4. Information to attach (optional if deemed irrelevant)
kubectl get pods -n OPERATOR_NAMESPACE
kubectl get ds -n OPERATOR_NAMESPACE
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.
WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation. Please ensure that NVIDIA kernel modules matching this driver version are installed separately.
WARNING: This NVIDIA driver package includes Vulkan components, but no Vulkan ICD loader was detected on this system. The NVIDIA Vulkan ICD will not function without the loader. Most distributions package the Vulkan loader; try installing the "vulkan-loader", "vulkan-icd-loader", or "libvulkan1" package.
WARNING: Unable to determine the path to install the libglvnd EGL vendor library config files. Check that you have pkg-config and the libglvnd development libraries installed, or specify a path with --glvnd-egl-config-path.
========== NVIDIA Software Installer ==========
echo -e '\n========== NVIDIA Software Installer ==========\n'
echo -e 'Starting installation of NVIDIA driver version 550.54.14 for Linux kernel version 4.18.0-513.18.1.el8_9.x86_64\n' Starting installation of NVIDIA driver version 550.54.14 for Linux kernel version 4.18.0-513.18.1.el8_9.x86_64
exec
flock -n 3
echo 332725
trap 'echo '\''Caught signal'\''; exit 1' HUP INT QUIT PIPE TERM
trap _shutdown EXIT
_unload_driver
rmmod_args=()
local rmmod_args
local nvidia_deps=0
local nvidia_refs=0
local nvidia_uvm_refs=0
local nvidia_modeset_refs=0
local nvidia_peermem_refs=0
echo 'Stopping NVIDIA persistence daemon...' Stopping NVIDIA persistence daemon...
'[' -f /var/run/nvidia-persistenced/nvidia-persistenced.pid ']'
'[' -f /var/run/nvidia-gridd/nvidia-gridd.pid ']'
'[' -f /var/run/nvidia-fabricmanager/nv-fabricmanager.pid ']'
echo 'Unloading NVIDIA driver kernel modules...' Unloading NVIDIA driver kernel modules...
'[' -f /sys/module/nvidia_modeset/refcnt ']'
'[' -f /sys/module/nvidia_uvm/refcnt ']'
'[' -f /sys/module/nvidia/refcnt ']'
'[' -f /sys/module/nvidia_peermem/refcnt ']'
'[' 0 -gt 0 ']'
'[' 0 -gt 0 ']'
'[' 0 -gt 0 ']'
'[' 0 -gt 0 ']'
'[' 0 -gt 0 ']'
return 0
_unmount_rootfs Unmounting NVIDIA driver rootfs...
echo 'Unmounting NVIDIA driver rootfs...'
findmnt -r -o TARGET
grep /run/nvidia/driver
_build
_kernel_requires_package
local proc_mount_arg= Checking NVIDIA driver packages...
echo 'Checking NVIDIA driver packages...'
[[ ! -d /usr/src/nvidia-550.54.14/kernel ]]
cd /usr/src/nvidia-550.54.14/kernel
proc_mount_arg='--proc-mount-point /lib/modules/4.18.0-513.18.1.el8_9.x86_64/proc' ++ ls -d -1 'precompiled/**'
return 0
_update_package_cache
'[' '' '!=' builtin ']' Updating the package cache...
echo 'Updating the package cache...'
yum -q makecache
_install_prerequisites ++ mktemp -d
local tmp_dir=/tmp/tmp.2PbAo42Ahy
trap 'rm -rf /tmp/tmp.2PbAo42Ahy' EXIT
cd /tmp/tmp.2PbAo42Ahy
echo 'Installing elfutils...' Installing elfutils...
dnf install -q -y elfutils-libelf.x86_64 elfutils-libelf-devel.x86_64 Error: Unable to find a match: elfutils-libelf-devel.x86_64 FATAL: failed to install elfutils packages. RHEL entitlement may be improperly deployed.
echo 'FATAL: failed to install elfutils packages. RHEL entitlement may be improperly deployed.'
exit 1 ++ rm -rf /tmp/tmp.2PbAo42Ahy
_shutdown
_unload_driver
rmmod_args=()
local rmmod_args
local nvidia_deps=0
local nvidia_refs=0
local nvidia_uvm_refs=0
local nvidia_modeset_refs=0
local nvidia_peermem_refs=0
echo 'Stopping NVIDIA persistence daemon...' Stopping NVIDIA persistence daemon...
'[' -f /var/run/nvidia-persistenced/nvidia-persistenced.pid ']'
'[' -f /var/run/nvidia-gridd/nvidia-gridd.pid ']'
'[' -f /var/run/nvidia-fabricmanager/nv-fabricmanager.pid ']'
echo 'Unloading NVIDIA driver kernel modules...' Unloading NVIDIA driver kernel modules...
'[' -f /sys/module/nvidia_modeset/refcnt ']'
'[' -f /sys/module/nvidia_uvm/refcnt ']'
'[' -f /sys/module/nvidia/refcnt ']'
'[' -f /sys/module/nvidia_peermem/refcnt ']'
'[' 0 -gt 0 ']'
'[' 0 -gt 0 ']'
'[' 0 -gt 0 ']'
'[' 0 -gt 0 ']'
'[' 0 -gt 0 ']'
return 0
_unmount_rootfs Unmounting NVIDIA driver rootfs...
echo 'Unmounting NVIDIA driver rootfs...'
findmnt -r -o TARGET
grep /run/nvidia/driver
rm -f /run/nvidia/nvidia-driver.pid /run/kernel/postinst.d/update-nvidia-driver
return 0
Mar 11 20:28:36 test-cnnmnql20b6ec423fsv0-brucetestro-v100-00000380 crio[9148]: time="2024-03-11 20:28:36.897237180-05:00" level=warning msg="Failed to mount subscriptions, skipping entry in /usr/share/containers/mounts.conf: saving data to container filesystem on host \"/var/data/crioruntimestorage/overlay-containers/ce75d5618b76c3fd6febf508a4d142f66ca6c46040f2c7f8a74bc0cbc88ceeb4/userdata/run/secrets\": write subscription data: write file: open /var/data/crioruntimestorage/overlay-containers/ce75d5618b76c3fd6febf508a4d142f66ca6c46040f2c7f8a74bc0cbc88ceeb4/userdata/run/secrets/etc-pki-entitlement/6292044582955687386-key.pem: no such file or directory" Mar 11 20:28:37 test-cnnmnql20b6ec423fsv0-brucetestro-v100-00000380 crio[9148]: time="2024-03-11 20:28:37.017844642-05:00" level=info msg="Created container ce75d5618b76c3fd6febf508a4d142f66ca6c46040f2c7f8a74bc0cbc88ceeb4: gpu-operator-resources/nvidia-driver-daemonset-c9zfb/nvidia-driver-ctr" id=845fbe19-ee47-4a2a-813f-d0bd23f6ba6c name=/runtime.v1.RuntimeService/CreateContainer Mar 11 20:28:37 test-cnnmnql20b6ec423fsv0-brucetestro-v100-00000380 crio[9148]: time="2024-03-11 20:28:37.018493833-05:00" level=info msg="Starting container: ce75d5618b76c3fd6febf508a4d142f66ca6c46040f2c7f8a74bc0cbc88ceeb4" id=cdff208d-b23c-4d0f-baae-db8e4dee04c1 name=/runtime.v1.RuntimeService/StartContainer Mar 11 20:28:37 test-cnnmnql20b6ec423fsv0-brucetestro-v100-00000380 crio[9148]: time="2024-03-11 20:28:37.025441319-05:00" level=info msg="Started container" PID=332725 containerID=ce75d5618b76c3fd6febf508a4d142f66ca6c46040f2c7f8a74bc0cbc88ceeb4 description=gpu-operator-resources/nvidia-driver-daemonset-c9zfb/nvidia-driver-ctr id=cdff208d-b23c-4d0f-baae-db8e4dee04c1 name=/runtime.v1.RuntimeService/StartContainer sandboxID=bf904eb1f2c645c2c74a61a73f0a1d70d4a530fcf971142816c6c05163b332d6
Collecting full debug bundle (optional):
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: operator_feedback@nvidia.com
tl;dr
It looks like the change is specifically the change from using the os.WriteFile function to using the umask.WriteFileIgnoreUmask function on this line: https://github.com/cri-o/cri-o/pull/7774/files#diff-23e01fcec1708a4fa51b3f495b7c7f075070b0a9c5a9195f349efee6d9444d4dR271
crio fails to mount the subscription to the container, as see in these logs (more above):