NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.83k stars 297 forks source link

Latest CRI-O (on 1.25/1.26) failing to install gpu-operator #680

Open KodieGlosserIBM opened 8 months ago

KodieGlosserIBM commented 8 months ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

tl;dr at the bottom

1. Quick Debug Information

2. Issue or feature description

Briefly explain the issue in terms of expected behavior and current behavior. Pulling in the most recent cri-o changes on OCP 4.12/4.13 https://github.com/cri-o/cri-o/compare/1b1a520...8724c4d CRI-O: cri-o-1.25.5-10.rhaos4.12.git8724c4d.el8 cri-o-1.26.5-7.rhaos4.13.git692ef91.el8

GPU installer is failling to install elfutils

Installing elfutils...
+ echo 'Installing elfutils...'
+ dnf install -q -y elfutils-libelf.x86_64 elfutils-libelf-devel.x86_64
Error: Unable to find a match: elfutils-libelf-devel.x86_64
FATAL: failed to install elfutils packages. RHEL entitlement may be improperly deployed.
+ echo 'FATAL: failed to install elfutils packages. RHEL entitlement may be improperly deployed.'
+ exit 1

3. Steps to reproduce the issue

Detailed steps to reproduce the issue.

Use container runtime crio-o on version(s) cri-o-1.25.5-10.rhaos4.12.git8724c4d.el8 cri-o-1.26.5-7.rhaos4.13.git692ef91.el8

4. Information to attach (optional if deemed irrelevant)

WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.

WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation. Please ensure that NVIDIA kernel modules matching this driver version are installed separately.

WARNING: This NVIDIA driver package includes Vulkan components, but no Vulkan ICD loader was detected on this system. The NVIDIA Vulkan ICD will not function without the loader. Most distributions package the Vulkan loader; try installing the "vulkan-loader", "vulkan-icd-loader", or "libvulkan1" package.

WARNING: Unable to determine the path to install the libglvnd EGL vendor library config files. Check that you have pkg-config and the libglvnd development libraries installed, or specify a path with --glvnd-egl-config-path.

========== NVIDIA Software Installer ==========

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: operator_feedback@nvidia.com

tl;dr

It looks like the change is specifically the change from using the os.WriteFile function to using the umask.WriteFileIgnoreUmask function on this line: https://github.com/cri-o/cri-o/pull/7774/files#diff-23e01fcec1708a4fa51b3f495b7c7f075070b0a9c5a9195f349efee6d9444d4dR271

crio fails to mount the subscription to the container, as see in these logs (more above):


Failed to mount subscriptions, skipping entry in /usr/share/containers/mounts.conf: saving data to container filesystem
``
KodieGlosserIBM commented 8 months ago

Emailed must gather to operator_feedback@nvidia.com

Zveroloff commented 8 months ago

I think this problem is RHEL or OpenShift specific. I have K8s 1.25.5 running on CRI-O 1.25.1 (runc) on Rocky Linux 8.7, GPU operator runs without issues

KodieGlosserIBM commented 8 months ago

@Zveroloff have you tried upgrade cri-o to cri-o-1.25.5-10. This is something we just started recently seeing after this last version bump.

shivamerla commented 7 months ago

@fabiendupont can you help to address this issue in CRI-O which is causing subscription mounts to fail.

kwilczynski commented 7 months ago

Hello everyone!

The work on the CRI-O's side (via https://github.com/cri-o/cri-o/issues/7880) has been completed already.

There should be no more issues with CRI-O 1.25 and 1.26 (newer releases of CRI-O were not affected) that would prevent this operator from being run.

shivamerla commented 7 months ago

Thanks for the update @kwilczynski

kwilczynski commented 7 months ago

@francisguillier, your issue appears to be unrelated to the problem we have here.

Hopefully, you were able to resolve it.

jmkanz commented 7 months ago

@kwilczynski - I saw that the fix was backported to 4.12.54 RHSA

I've updated a cluster to this newer version and still see the issue present:

Worker Info:

cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com     Ready,SchedulingDisabled   canary,worker   27h     v1.25.16+9946c63   169.60.156.4    <none>        Red Hat Enterprise Linux CoreOS 412.86.202403280709-0 (Ootpa)   4.18.0-372.98.1.el8_6.x86_64   cri-o://1.25.5-13.1.rhaos4.12.git76343da.el8

Nvidia Pods on Worker:

╰$ oc get pods -o wide -A | grep cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com | grep nvidia
nvidia-gpu-operator                                gpu-feature-discovery-rcdcf                                             0/1     Init:0/1                     0                  162m    10.130.2.16     cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com     <none>           <none>
nvidia-gpu-operator                                nvidia-container-toolkit-daemonset-lxpm7                                0/1     Init:0/1                     0                  162m    10.130.2.17     cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com     <none>           <none>
nvidia-gpu-operator                                nvidia-dcgm-exporter-n227d                                              0/1     Init:0/1                     0                  162m    169.60.156.4    cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com     <none>           <none>
nvidia-gpu-operator                                nvidia-dcgm-kr6v8                                                       0/1     Init:0/1                     0                  162m    169.60.156.4    cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com     <none>           <none>
nvidia-gpu-operator                                nvidia-device-plugin-daemonset-stgwm                                    0/1     Init:0/1                     0                  162m    10.130.2.19     cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com     <none>           <none>
nvidia-gpu-operator                                nvidia-driver-daemonset-kz5th                                           0/1     CrashLoopBackOff             294 (53s ago)      26h     10.130.2.2      cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com     <none>           <none>
nvidia-gpu-operator                                nvidia-node-status-exporter-rcmdj                                       1/1     Running                      3                  28h     10.130.2.5      cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com     <none>           <none>
nvidia-gpu-operator                                nvidia-operator-validator-hpkvt                                         0/1     Init:0/4                     0                  162m    10.130.2.18     cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com     <none>           <none>

The output of the failing pod shows the same error as before:

+ echo 'Installing elfutils...'
156
Installing elfutils...
157
+ dnf install -q -y elfutils-libelf.x86_64 elfutils-libelf-devel.x86_64
158
Error: Unable to find a match: elfutils-libelf-devel.x86_64
159
FATAL: failed to install elfutils packages. RHEL entitlement may be improperly deployed.
160
+ echo 'FATAL: failed to install elfutils packages. RHEL entitlement may be improperly deployed.'
161
+ exit 1
162
++ rm -rf /tmp/tmp.AIojKsyUdp
haircommander commented 7 months ago

@jmkanz a couple of things:

shivamerla commented 7 months ago

@jmkanz can you post the status of all pods in the cluster please (specially coredns).

cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com     Ready,SchedulingDisabled   canary,worker   27h     v1.25.16+9946c63   169.60.156.4    <none>        Red Hat Enterprise Linux CoreOS 412.86.202403280709-0 (Ootpa)   4.18.0-372.98.1.el8_6.x86_64   cri-o://1.25.5-13.1.rhaos4.12.git76343da.el8

GPU Operator does seem to cordon the node in this case, so wondering if any networking pods are being evicted, which will cause the driver install to fail.

jmkanz commented 7 months ago

@shivamerla - I've manually cordoned this node since i've updated it to the latest version of Open Shift 4.12

The cordon should not impact the nvidia pods as they run as daemonsets. I've cordoned other nodes in the cluster as well (that are on a older version of CoreOS) and they run fine with or without the cordon.

Additionally, other pods are running fine on the node. I can give you an output of them. Please see below: edit to sanitize IP's

oc get pods -A -o wide |grep cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com
ibm-object-s3fs                                    ibmcloud-object-storage-driver-l925j                                    1/1     Running                      0                   126m    
ibm-observe                                        logdna-agent-8xssh                                                      1/1     Running                      3                   2d3h    
ibm-observe                                        sysdig-agent-n6v9s                                                      1/1     Running                      3                   2d3h    
jeg                                                kernel-image-puller-5bfpp                                               1/1     Running                      3                   2d3h    
kube-system                                        istio-cni-node-cl5sj                                                    1/1     Running                      3                   2d3h    
nvidia-gpu-operator                                gpu-feature-discovery-rcdcf                                             0/1     Init:0/1                     0                   25h     
nvidia-gpu-operator                                nvidia-container-toolkit-daemonset-lxpm7                                0/1     Init:0/1                     0                  
nvidia-gpu-operator                                nvidia-dcgm-exporter-n227d                                              0/1     Init:0/1                     0                   25h     
nvidia-gpu-operator                                nvidia-dcgm-kr6v8                                                       0/1     Init:0/1                     0                   25h     
nvidia-gpu-operator                                nvidia-device-plugin-daemonset-stgwm                                    0/1     Init:0/1                     0                   25h     
nvidia-gpu-operator                                nvidia-driver-daemonset-kz5th                                           0/1     CrashLoopBackOff             558 (14s ago)       2d1h    
nvidia-gpu-operator                                nvidia-node-status-exporter-rcmdj                                       1/1     Running                      3                   2d3h    
nvidia-gpu-operator                                nvidia-operator-validator-hpkvt                                         0/1     Init:0/4                     0                   25h    
openshift-cluster-node-tuning-operator             tuned-vnkmb                                                             1/1     Running                      1                   47h     
openshift-dns                                      dns-default-4vdm9                                                       2/2     Running                      2                   47h     
openshift-dns                                      node-resolver-l5vtp                                                     1/1     Running                      1                   47h     
openshift-image-registry                           node-ca-zmnhr                                                           1/1     Running                      1                   47h     
openshift-ingress-canary                           ingress-canary-xrdwz                                                    1/1     Running                      1                   47h     
openshift-machine-config-operator                  machine-config-daemon-bpj6x                                             2/2     Running                      2                   
openshift-monitoring                               node-exporter-djqxn                                                     2/2     Running                      2                   47h     
openshift-multus                                   multus-additional-cni-plugins-zhw2l                                     1/1     Running                      1                   47h     
openshift-multus                                   multus-ddwks                                                            1/1     Running                      1                   47h    
openshift-multus                                   network-metrics-daemon-cbj56                                            2/2     Running                      2                   47h     
openshift-network-diagnostics                      network-check-target-c6snb                                              1/1     Running                      1                   47h     
openshift-nfd                                      nfd-worker-pflkc                                                        1/1     Running                      3                   2d3h    
openshift-sdn                                      sdn-j7kng                                                               2/2     Running                      2                   47h     
openshift-storage                                  csi-cephfsplugin-fndsj                                                  2/2     Running                      6                   2d3h   
openshift-storage                                  csi-rbdplugin-5c6ml                                                     3/3     Running                      9                   2d3h    
tekton-pipelines                                   pwa-2r24x                                                               1/1     Running                      3                   2d3h    

DNS Pods for the cluster as well:

╰$ oc get pods -A |grep dns
openshift-dns-operator                             dns-operator-7f86f6f997-766l4                                           2/2     Running                      0                   47h
openshift-dns                                      dns-default-4n6hq                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-4vdm9                                                       2/2     Running                      2                   47h
openshift-dns                                      dns-default-7v9rx                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-9wwps                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-bcv7s                                                       2/2     Running                      2                   47h
openshift-dns                                      dns-default-bzsmp                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-csrpd                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-d677l                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-dv45x                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-j7xcv                                                       2/2     Running                      4                   47h
openshift-dns                                      dns-default-jb62l                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-lkq76                                                       2/2     Running                      2                   47h
openshift-dns                                      dns-default-lpsfq                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-m6hr9                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-pf825                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-tj4bw                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-zjpsz                                                       2/2     Running                      2                   47h
openshift-dns                                      dns-default-zl52j                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-zsgx8                                                       2/2     Running                      0                   47h
openshift-dns                                      node-resolver-2vsc8                                                     1/1     Running                      1                   47h
openshift-dns                                      node-resolver-48nkh                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-59vb8                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-74btd                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-c5d4p                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-c5q44                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-clck8                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-fjgnb                                                     1/1     Running                      1                   47h
openshift-dns                                      node-resolver-g54sd                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-gd6rk                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-gs4z2                                                     1/1     Running                      2                   47h
openshift-dns                                      node-resolver-l5n4z                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-l5vtp                                                     1/1     Running                      1                   47h
openshift-dns                                      node-resolver-l9kjc                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-rls8p                                                     1/1     Running                      1                   47h
openshift-dns                                      node-resolver-tr6wf                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-vlnj4                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-whzp6                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-wrnfs                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-xqwxs                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-zhb5c                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-zm4m8                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-zph8p                                                     1/1     Running                      0                   47h
jmkanz commented 7 months ago

@jmkanz a couple of things:

  • a better forum may be (if possible) an openshift Jira ticket, as this forum is really more for upstream cri-o, and these versions are out of upstream support
  • can you help me put together a more minimal reproducer? I attempted to install the nvidia operator, and created a clusterpolicy and nvidia driver instance, but I wonder if I did the right steps as I'm getting different failures (and I doubt the cluster I installed has GPUs to provision)

    • I also tried to use a ubi8 image and I was able to install packages (elfutils was installed in ubi8 base, but I could install other packages, and I could also install it in ubi8-minimal with microdnf). I do get warnings about not having entitlement certs (Found 0 entitlement certificates), but that's a different one than you are hitting

Hey @haircommander - Thanks for your reply. We can move the conversation over to the other git issue in the CRI-O repo if you prefer? This is the NVIDIA one. Additionally, if your cluster doesn't have GPU's I doubt the install will even begin since you need the correct labels from NFD operator for GPU enabled workers.

I believe @KodieGlosserIBM has the issue open in JIRA with Red Hat still

haircommander commented 7 months ago

ah I thought I was commenting there :upside_down_face: . this is fine too if this feels right.

Still wondering about a more minimal reproducer, potentially without nvidia operator in the picture. Or, if you could help me get access to the environment with this failing, that would work too

jmkanz commented 7 months ago

This seems to be resolved. I noticed this cluster had a ClusterPolicy that was not using the OCP Driver Toolkit which prevents these entitlement issues.

Please see links below for more information

Open Shift Driver Toolkit Info: https://docs.openshift.com/container-platform/4.12/hardware_enablement/psap-driver-toolkit.html

NVIDIA Docs on Installation with / without Driver Toolkit: https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/steps-overview.html

kwilczynski commented 7 months ago

@jmkanz and @KodieGlosserIBM, thank you for the update!

Good to know that things are working fine. :tada:

kwilczynski commented 6 months ago

Hello everyone! :wave: Are we still having issues with the operator installation on CRI-O 1.25 and 1.26?

I think this problem has been resolved, and we could close this issue? Thoughts?

kwilczynski commented 3 months ago

Do we still need this issue to be open? Any more troubles? I believe we can resolve it now.