KodieGlosserIBM commented 8 months ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

tl;dr at the bottom

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): RHEL8.9
Kernel Version: 4.18.0-513.18.1.el8_9.x86_64
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): cri-o (version 1.25 and 1.26)
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): OCP (version 4.12, 4.13)
GPU Operator Version: 23.9.2 (latest)

2. Issue or feature description

Briefly explain the issue in terms of expected behavior and current behavior. Pulling in the most recent cri-o changes on OCP 4.12/4.13 https://github.com/cri-o/cri-o/compare/1b1a520...8724c4d CRI-O: cri-o-1.25.5-10.rhaos4.12.git8724c4d.el8 cri-o-1.26.5-7.rhaos4.13.git692ef91.el8

GPU installer is failling to install elfutils

Installing elfutils...
+ echo 'Installing elfutils...'
+ dnf install -q -y elfutils-libelf.x86_64 elfutils-libelf-devel.x86_64
Error: Unable to find a match: elfutils-libelf-devel.x86_64
FATAL: failed to install elfutils packages. RHEL entitlement may be improperly deployed.
+ echo 'FATAL: failed to install elfutils packages. RHEL entitlement may be improperly deployed.'
+ exit 1

3. Steps to reproduce the issue

Detailed steps to reproduce the issue.

Use container runtime crio-o on version(s) cri-o-1.25.5-10.rhaos4.12.git8724c4d.el8 cri-o-1.26.5-7.rhaos4.13.git692ef91.el8

4. Information to attach (optional if deemed irrelevant)

[x] kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE

k get pods -n gpu-operator-resources -o wide   
NAME                                       READY   STATUS             RESTARTS        AGE     IP               NODE          NOMINATED NODE   READINESS GATES
gpu-feature-discovery-9qq4g                0/1     Init:0/1           0               3h28m   172.17.162.202   10.180.8.40   <none>           <none>
gpu-operator-8b54f655-45f6k                1/1     Running            0               3h46m   172.17.162.253   10.180.8.40   <none>           <none>
nfd-controller-manager-5988c689d-ddg4q     2/2     Running            0               4h      172.17.162.248   10.180.8.40   <none>           <none>
nfd-master-966d4c54c-l7mv4                 1/1     Running            0               3h47m   172.17.162.251   10.180.8.40   <none>           <none>
nfd-worker-48tnv                           1/1     Running            0               3h47m   10.180.8.39      10.180.8.39   <none>           <none>
nfd-worker-8n856                           1/1     Running            1 (3h47m ago)   3h47m   10.180.8.40      10.180.8.40   <none>           <none>
nfd-worker-jqssq                           1/1     Running            0               3h47m   10.180.8.38      10.180.8.38   <none>           <none>
nvidia-container-toolkit-daemonset-n7k8z   0/1     Init:0/1           0               3h28m   172.17.162.230   10.180.8.40   <none>           <none>
nvidia-dcgm-exporter-hmg7m                 0/1     Init:0/1           0               3h28m   10.180.8.40      10.180.8.40   <none>           <none>
nvidia-dcgm-fnz2j                          0/1     Init:0/1           0               3h28m   10.180.8.40      10.180.8.40   <none>           <none>
nvidia-device-plugin-daemonset-zcbbl       0/1     Init:0/1           0               3h28m   172.17.162.254   10.180.8.40   <none>           <none>
nvidia-driver-daemonset-c9zfb              0/1     CrashLoopBackOff   43 (5m ago)     3h29m   172.17.162.220   10.180.8.40   <none>           <none>
nvidia-node-status-exporter-b6gxl          1/1     Running            0               3h43m   172.17.162.198   10.180.8.40   <none>           <none>
nvidia-operator-validator-zr6sb            0/1     Init:0/4           0               3h28m   172.17.162.252   10.180.8.40   <none>           <none>

[x] kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE

k get ds -n gpu-operator-resources                               
NAME                                 DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                      AGE
gpu-feature-discovery                1         1         0       1            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true   3h43m
nfd-worker                           3         3         3       3            3           <none>                                             3h48m
nvidia-container-toolkit-daemonset   1         1         0       1            0           nvidia.com/gpu.deploy.container-toolkit=true       3h43m
nvidia-dcgm                          1         1         0       1            0           nvidia.com/gpu.deploy.dcgm=true                    3h43m
nvidia-dcgm-exporter                 1         1         0       1            0           nvidia.com/gpu.deploy.dcgm-exporter=true           3h43m
nvidia-device-plugin-daemonset       1         1         0       1            0           nvidia.com/gpu.deploy.device-plugin=true           3h43m
nvidia-driver-daemonset              1         1         0       1            0           nvidia.com/gpu.deploy.driver=true                  3h43m
nvidia-mig-manager                   0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true             3h43m
nvidia-node-status-exporter          1         1         1       1            1           nvidia.com/gpu.deploy.node-status-exporter=true    3h43m
nvidia-operator-validator            1         1         0       1            0           nvidia.com/gpu.deploy.operator-validator=true      3h43m

[x] If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME

k describe pod -n gpu-operator-resources nvidia-driver-daemonset-c9zfb 
Name:                 nvidia-driver-daemonset-c9zfb
Namespace:            gpu-operator-resources
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      nvidia-driver
Node:                 10.180.8.40/10.180.8.40
Start Time:           Mon, 11 Mar 2024 16:59:01 -0500
Labels:               app=nvidia-driver-daemonset
                  app.kubernetes.io/component=nvidia-driver
                  controller-revision-hash=dc74cc498
                  nvidia.com/precompiled=false
                  pod-template-generation=3
Annotations:          cni.projectcalico.org/containerID: bf904eb1f2c645c2c74a61a73f0a1d70d4a530fcf971142816c6c05163b332d6
                  cni.projectcalico.org/podIP: 172.17.162.220/32
                  cni.projectcalico.org/podIPs: 172.17.162.220/32
                  k8s.v1.cni.cncf.io/network-status:
                    [{
                        "name": "k8s-pod-network",
                        "ips": [
                            "172.17.162.220"
                        ],
                        "default": true,
                        "dns": {}
                    }]
                  kubectl.kubernetes.io/default-container: nvidia-driver-ctr
                  openshift.io/scc: nvidia-driver
Status:               Running
IP:                   172.17.162.220
IPs:
IP:           172.17.162.220
Controlled By:  DaemonSet/nvidia-driver-daemonset
Init Containers:
k8s-driver-manager:
Container ID:  cri-o://51588c2c91637fbdaaa68b22e2b9100199a5b8e3afa0b98ea471f1acdc64a716
Image:         nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:27c44f4720a4abf780217bd5e7903e4a008ebdbcf71238c4f106a0c22654776c
Image ID:      nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:27c44f4720a4abf780217bd5e7903e4a008ebdbcf71238c4f106a0c22654776c
Port:          <none>
Host Port:     <none>
Command:
  driver-manager
Args:
  uninstall_driver
State:          Terminated
  Reason:       Completed
  Exit Code:    0
  Started:      Mon, 11 Mar 2024 16:59:03 -0500
  Finished:     Mon, 11 Mar 2024 16:59:37 -0500
Ready:          True
Restart Count:  0
Environment:
  NODE_NAME:                    (v1:spec.nodeName)
  NVIDIA_VISIBLE_DEVICES:      void
  ENABLE_GPU_POD_EVICTION:     true
  ENABLE_AUTO_DRAIN:           true
  DRAIN_USE_FORCE:             false
  DRAIN_POD_SELECTOR_LABEL:    
  DRAIN_TIMEOUT_SECONDS:       0s
  DRAIN_DELETE_EMPTYDIR_DATA:  false
  OPERATOR_NAMESPACE:          gpu-operator-resources (v1:metadata.namespace)
Mounts:
  /host from host-root (ro)
  /run/nvidia from run-nvidia (rw)
  /sys from host-sys (rw)
  /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-m9xwl (ro)
Containers:
nvidia-driver-ctr:
Container ID:  cri-o://ce75d5618b76c3fd6febf508a4d142f66ca6c46040f2c7f8a74bc0cbc88ceeb4
Image:         nvcr.io/nvidia/driver@sha256:6f51a22e01fd08ab0fde543e0c4dc6d7f7abb0f20d38205a98f3f1716cb3d7d3
Image ID:      nvcr.io/nvidia/driver@sha256:6f51a22e01fd08ab0fde543e0c4dc6d7f7abb0f20d38205a98f3f1716cb3d7d3
Port:          <none>
Host Port:     <none>
Command:
  nvidia-driver
Args:
  init
State:          Waiting
  Reason:       CrashLoopBackOff
Last State:     Terminated
  Reason:       Error
  Exit Code:    1
  Started:      Mon, 11 Mar 2024 20:28:37 -0500
  Finished:     Mon, 11 Mar 2024 20:28:49 -0500
Ready:          False
Restart Count:  44
Startup:        exec [sh -c nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready] delay=60s timeout=60s period=10s #success=1 #failure=120
Environment:    <none>
Mounts:
  /dev/log from dev-log (rw)
  /host-etc/os-release from host-os-release (ro)
  /lib/firmware from nv-firmware (rw)
  /run/mellanox/drivers from run-mellanox-drivers (rw)
  /run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
  /run/nvidia from run-nvidia (rw)
  /run/nvidia-topologyd from run-nvidia-topologyd (rw)
  /sys/devices/system/memory/auto_online_blocks from sysfs-memory-online (rw)
  /sys/module/firmware_class/parameters/path from firmware-search-path (rw)
  /var/log from var-log (rw)
  /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-m9xwl (ro)
Conditions:
Type              Status
Initialized       True 
Ready             False 
ContainersReady   False 
PodScheduled      True 
Volumes:
run-nvidia:
Type:          HostPath (bare host directory volume)
Path:          /run/nvidia
HostPathType:  DirectoryOrCreate
var-log:
Type:          HostPath (bare host directory volume)
Path:          /var/log
HostPathType:  
dev-log:
Type:          HostPath (bare host directory volume)
Path:          /dev/log
HostPathType:  
host-os-release:
Type:          HostPath (bare host directory volume)
Path:          /etc/os-release
HostPathType:  
run-nvidia-topologyd:
Type:          HostPath (bare host directory volume)
Path:          /run/nvidia-topologyd
HostPathType:  DirectoryOrCreate
mlnx-ofed-usr-src:
Type:          HostPath (bare host directory volume)
Path:          /run/mellanox/drivers/usr/src
HostPathType:  DirectoryOrCreate
run-mellanox-drivers:
Type:          HostPath (bare host directory volume)
Path:          /run/mellanox/drivers
HostPathType:  DirectoryOrCreate
run-nvidia-validations:
Type:          HostPath (bare host directory volume)
Path:          /run/nvidia/validations
HostPathType:  DirectoryOrCreate
host-root:
Type:          HostPath (bare host directory volume)
Path:          /
HostPathType:  
host-sys:
Type:          HostPath (bare host directory volume)
Path:          /sys
HostPathType:  Directory
firmware-search-path:
Type:          HostPath (bare host directory volume)
Path:          /sys/module/firmware_class/parameters/path
HostPathType:  
sysfs-memory-online:
Type:          HostPath (bare host directory volume)
Path:          /sys/devices/system/memory/auto_online_blocks
HostPathType:  
nv-firmware:
Type:          HostPath (bare host directory volume)
Path:          /run/nvidia/driver/lib/firmware
HostPathType:  DirectoryOrCreate
kube-api-access-m9xwl:
Type:                    Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds:  3607
ConfigMapName:           kube-root-ca.crt
ConfigMapOptional:       <nil>
DownwardAPI:             true
ConfigMapName:           openshift-service-ca.crt
ConfigMapOptional:       <nil>
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.driver=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                         node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                         node.kubernetes.io/not-ready:NoExecute op=Exists
                         node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                         node.kubernetes.io/unreachable:NoExecute op=Exists
                         node.kubernetes.io/unschedulable:NoSchedule op=Exists
                         nvidia.com/gpu:NoSchedule op=Exists
Events:
Type     Reason   Age                    From     Message
----     ------   ----                   ----     -------
Normal   Pulled   80m (x29 over 3h29m)   kubelet  Container image "nvcr.io/nvidia/driver@sha256:6f51a22e01fd08ab0fde543e0c4dc6d7f7abb0f20d38205a98f3f1716cb3d7d3" already present on machine
Warning  BackOff  29s (x947 over 3h29m)  kubelet  Back-off restarting failed container nvidia-driver-ctr in pod nvidia-driver-daemonset-c9zfb_gpu-operator-resources(8a0d8d4f-9f88-4c42-930d-508ab7653a98)

[x] If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers


k logs -n gpu-operator-resources nvidia-driver-daemonset-c9zfb -c nvidia-driver-ctr -p
+ set -eu
+ RUN_DIR=/run/nvidia
+ PID_FILE=/run/nvidia/nvidia-driver.pid
+ DRIVER_VERSION=550.54.14
+ KERNEL_UPDATE_HOOK=/run/kernel/postinst.d/update-nvidia-driver
+ NUM_VGPU_DEVICES=0
+ NVIDIA_MODULE_PARAMS=()
+ NVIDIA_UVM_MODULE_PARAMS=()
DRIVER_ARCH is x86_64
+ NVIDIA_MODESET_MODULE_PARAMS=()
+ NVIDIA_PEERMEM_MODULE_PARAMS=()
+ TARGETARCH=amd64
+ USE_HOST_MOFED=false
+ DNF_RELEASEVER=
+ RHEL_VERSION=
+ RHEL_MAJOR_VERSION=8
+ OPEN_KERNEL_MODULES_ENABLED=false
+ [[ false == \t\r\u\e ]]
+ KERNEL_TYPE=kernel
+ DRIVER_ARCH=x86_64
+ DRIVER_ARCH=x86_64
+ echo 'DRIVER_ARCH is x86_64'
+++ dirname -- /usr/local/bin/nvidia-driver
++ cd -- /usr/local/bin
++ pwd
+ SCRIPT_DIR=/usr/local/bin
+ source /usr/local/bin/common.sh
++ GPU_DIRECT_RDMA_ENABLED=false
++ GDS_ENABLED=false
++ GDRCOPY_ENABLED=false
+ '[' 1 -eq 0 ']'
+ command=init
+ shift
+ case "${command}" in
++ getopt -l accept-license -o a --
+ options=' --'
+ '[' 0 -ne 0 ']'
+ eval set -- ' --'
++ set -- --
+ ACCEPT_LICENSE=
++ uname -r
+ KERNEL_VERSION=4.18.0-513.18.1.el8_9.x86_64
+ PRIVATE_KEY=
+ PACKAGE_TAG=
+ for opt in ${options}
+ case "$opt" in
+ shift
+ break
+ '[' 0 -ne 0 ']'
+ _resolve_rhel_version
+ _get_rhel_version_from_kernel
+ local rhel_version_underscore rhel_version_arr
++ echo 4.18.0-513.18.1.el8_9.x86_64
++ sed 's/.*el\([0-9]\+_[0-9]\+\).*/\1/g'
+ rhel_version_underscore=8_9
+ [[ ! 8_9 =~ ^[0-9]+_[0-9]+$ ]]
+ IFS=_
+ read -r -a rhel_version_arr
+ [[ 2 -ne 2 ]]
+ RHEL_VERSION=8.9
+ echo 'RHEL VERSION successfully resolved from kernel: 8.9'
RHEL VERSION successfully resolved from kernel: 8.9
+ return 0
+ [[ -z '' ]]
+ DNF_RELEASEVER=8.9
+ return 0
+ init
+ _prepare_exclusive
+ _prepare
+ '[' passthrough = vgpu ']'
+ sh NVIDIA-Linux-x86_64-550.54.14.run -x
Creating directory NVIDIA-Linux-x86_64-550.54.14
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 550.54.14........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
+ cd NVIDIA-Linux-x86_64-550.54.14
+ sh /tmp/install.sh nvinstall
DRIVER_ARCH is x86_64

WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.

WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation. Please ensure that NVIDIA kernel modules matching this driver version are installed separately.

WARNING: This NVIDIA driver package includes Vulkan components, but no Vulkan ICD loader was detected on this system. The NVIDIA Vulkan ICD will not function without the loader. Most distributions package the Vulkan loader; try installing the "vulkan-loader", "vulkan-icd-loader", or "libvulkan1" package.

WARNING: Unable to determine the path to install the libglvnd EGL vendor library config files. Check that you have pkg-config and the libglvnd development libraries installed, or specify a path with --glvnd-egl-config-path.

mkdir -p /usr/src/nvidia-550.54.14
mv LICENSE mkprecompiled kernel /usr/src/nvidia-550.54.14
sed '9,${/^(kernel|LICENSE)/!d}' .manifest

========== NVIDIA Software Installer ==========

echo -e '\n========== NVIDIA Software Installer ==========\n'
echo -e 'Starting installation of NVIDIA driver version 550.54.14 for Linux kernel version 4.18.0-513.18.1.el8_9.x86_64\n' Starting installation of NVIDIA driver version 550.54.14 for Linux kernel version 4.18.0-513.18.1.el8_9.x86_64
exec
flock -n 3
echo 332725
trap 'echo '\''Caught signal'\''; exit 1' HUP INT QUIT PIPE TERM
trap _shutdown EXIT
_unload_driver
rmmod_args=()
local rmmod_args
local nvidia_deps=0
local nvidia_refs=0
local nvidia_uvm_refs=0
local nvidia_modeset_refs=0
local nvidia_peermem_refs=0
echo 'Stopping NVIDIA persistence daemon...' Stopping NVIDIA persistence daemon...
'[' -f /var/run/nvidia-persistenced/nvidia-persistenced.pid ']'
'[' -f /var/run/nvidia-gridd/nvidia-gridd.pid ']'
'[' -f /var/run/nvidia-fabricmanager/nv-fabricmanager.pid ']'
echo 'Unloading NVIDIA driver kernel modules...' Unloading NVIDIA driver kernel modules...
'[' -f /sys/module/nvidia_modeset/refcnt ']'
'[' -f /sys/module/nvidia_uvm/refcnt ']'
'[' -f /sys/module/nvidia/refcnt ']'
'[' -f /sys/module/nvidia_peermem/refcnt ']'
'[' 0 -gt 0 ']'
'[' 0 -gt 0 ']'
'[' 0 -gt 0 ']'
'[' 0 -gt 0 ']'
'[' 0 -gt 0 ']'
return 0
_unmount_rootfs Unmounting NVIDIA driver rootfs...
echo 'Unmounting NVIDIA driver rootfs...'
findmnt -r -o TARGET
grep /run/nvidia/driver
_build
_kernel_requires_package
local proc_mount_arg= Checking NVIDIA driver packages...
echo 'Checking NVIDIA driver packages...'
[[ ! -d /usr/src/nvidia-550.54.14/kernel ]]
cd /usr/src/nvidia-550.54.14/kernel
proc_mount_arg='--proc-mount-point /lib/modules/4.18.0-513.18.1.el8_9.x86_64/proc' ++ ls -d -1 'precompiled/**'
return 0
_update_package_cache
'[' '' '!=' builtin ']' Updating the package cache...
echo 'Updating the package cache...'
yum -q makecache
_install_prerequisites ++ mktemp -d
local tmp_dir=/tmp/tmp.2PbAo42Ahy
trap 'rm -rf /tmp/tmp.2PbAo42Ahy' EXIT
cd /tmp/tmp.2PbAo42Ahy
echo 'Installing elfutils...' Installing elfutils...
dnf install -q -y elfutils-libelf.x86_64 elfutils-libelf-devel.x86_64 Error: Unable to find a match: elfutils-libelf-devel.x86_64 FATAL: failed to install elfutils packages. RHEL entitlement may be improperly deployed.
echo 'FATAL: failed to install elfutils packages. RHEL entitlement may be improperly deployed.'
exit 1 ++ rm -rf /tmp/tmp.2PbAo42Ahy
_shutdown
_unload_driver
rmmod_args=()
local rmmod_args
local nvidia_deps=0
local nvidia_refs=0
local nvidia_uvm_refs=0
local nvidia_modeset_refs=0
local nvidia_peermem_refs=0
echo 'Stopping NVIDIA persistence daemon...' Stopping NVIDIA persistence daemon...
'[' -f /var/run/nvidia-persistenced/nvidia-persistenced.pid ']'
'[' -f /var/run/nvidia-gridd/nvidia-gridd.pid ']'
'[' -f /var/run/nvidia-fabricmanager/nv-fabricmanager.pid ']'
echo 'Unloading NVIDIA driver kernel modules...' Unloading NVIDIA driver kernel modules...
'[' -f /sys/module/nvidia_modeset/refcnt ']'
'[' -f /sys/module/nvidia_uvm/refcnt ']'
'[' -f /sys/module/nvidia/refcnt ']'
'[' -f /sys/module/nvidia_peermem/refcnt ']'
'[' 0 -gt 0 ']'
'[' 0 -gt 0 ']'
'[' 0 -gt 0 ']'
'[' 0 -gt 0 ']'
'[' 0 -gt 0 ']'
return 0
_unmount_rootfs Unmounting NVIDIA driver rootfs...
echo 'Unmounting NVIDIA driver rootfs...'
findmnt -r -o TARGET
grep /run/nvidia/driver
rm -f /run/nvidia/nvidia-driver.pid /run/kernel/postinst.d/update-nvidia-driver
return 0
```
- [x] Output from running `nvidia-smi` from the driver container: `kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi`
- [x] containerd logs `journalctl -u containerd > containerd.log`
```
Mar 11 20:28:36 test-cnnmnql20b6ec423fsv0-brucetestro-v100-00000380 crio[9148]: time="2024-03-11 20:28:36.897237180-05:00" level=warning msg="Failed to mount subscriptions, skipping entry in /usr/share/containers/mounts.conf: saving data to container filesystem on host \"/var/data/crioruntimestorage/overlay-containers/ce75d5618b76c3fd6febf508a4d142f66ca6c46040f2c7f8a74bc0cbc88ceeb4/userdata/run/secrets\": write subscription data: write file: open /var/data/crioruntimestorage/overlay-containers/ce75d5618b76c3fd6febf508a4d142f66ca6c46040f2c7f8a74bc0cbc88ceeb4/userdata/run/secrets/etc-pki-entitlement/6292044582955687386-key.pem: no such file or directory" Mar 11 20:28:37 test-cnnmnql20b6ec423fsv0-brucetestro-v100-00000380 crio[9148]: time="2024-03-11 20:28:37.017844642-05:00" level=info msg="Created container ce75d5618b76c3fd6febf508a4d142f66ca6c46040f2c7f8a74bc0cbc88ceeb4: gpu-operator-resources/nvidia-driver-daemonset-c9zfb/nvidia-driver-ctr" id=845fbe19-ee47-4a2a-813f-d0bd23f6ba6c name=/runtime.v1.RuntimeService/CreateContainer Mar 11 20:28:37 test-cnnmnql20b6ec423fsv0-brucetestro-v100-00000380 crio[9148]: time="2024-03-11 20:28:37.018493833-05:00" level=info msg="Starting container: ce75d5618b76c3fd6febf508a4d142f66ca6c46040f2c7f8a74bc0cbc88ceeb4" id=cdff208d-b23c-4d0f-baae-db8e4dee04c1 name=/runtime.v1.RuntimeService/StartContainer Mar 11 20:28:37 test-cnnmnql20b6ec423fsv0-brucetestro-v100-00000380 crio[9148]: time="2024-03-11 20:28:37.025441319-05:00" level=info msg="Started container" PID=332725 containerID=ce75d5618b76c3fd6febf508a4d142f66ca6c46040f2c7f8a74bc0cbc88ceeb4 description=gpu-operator-resources/nvidia-driver-daemonset-c9zfb/nvidia-driver-ctr id=cdff208d-b23c-4d0f-baae-db8e4dee04c1 name=/runtime.v1.RuntimeService/StartContainer sandboxID=bf904eb1f2c645c2c74a61a73f0a1d70d4a530fcf971142816c6c05163b332d6

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: operator_feedback@nvidia.com

tl;dr

It looks like the change is specifically the change from using the os.WriteFile function to using the umask.WriteFileIgnoreUmask function on this line: https://github.com/cri-o/cri-o/pull/7774/files#diff-23e01fcec1708a4fa51b3f495b7c7f075070b0a9c5a9195f349efee6d9444d4dR271

crio fails to mount the subscription to the container, as see in these logs (more above):


Failed to mount subscriptions, skipping entry in /usr/share/containers/mounts.conf: saving data to container filesystem
``

KodieGlosserIBM commented 8 months ago

Emailed must gather to operator_feedback@nvidia.com

Zveroloff commented 8 months ago

I think this problem is RHEL or OpenShift specific. I have K8s 1.25.5 running on CRI-O 1.25.1 (runc) on Rocky Linux 8.7, GPU operator runs without issues

KodieGlosserIBM commented 8 months ago

@Zveroloff have you tried upgrade cri-o to cri-o-1.25.5-10. This is something we just started recently seeing after this last version bump.

shivamerla commented 8 months ago

@fabiendupont can you help to address this issue in CRI-O which is causing subscription mounts to fail.

kwilczynski commented 7 months ago

Hello everyone!

The work on the CRI-O's side (via https://github.com/cri-o/cri-o/issues/7880) has been completed already.

There should be no more issues with CRI-O 1.25 and 1.26 (newer releases of CRI-O were not affected) that would prevent this operator from being run.

shivamerla commented 7 months ago

Thanks for the update @kwilczynski

kwilczynski commented 7 months ago

@francisguillier, your issue appears to be unrelated to the problem we have here.

Hopefully, you were able to resolve it.

jmkanz commented 7 months ago

@kwilczynski - I saw that the fix was backported to 4.12.54 RHSA

I've updated a cluster to this newer version and still see the issue present:

Worker Info:

cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com     Ready,SchedulingDisabled   canary,worker   27h     v1.25.16+9946c63   169.60.156.4    <none>        Red Hat Enterprise Linux CoreOS 412.86.202403280709-0 (Ootpa)   4.18.0-372.98.1.el8_6.x86_64   cri-o://1.25.5-13.1.rhaos4.12.git76343da.el8

Nvidia Pods on Worker:

╰$ oc get pods -o wide -A | grep cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com | grep nvidia
nvidia-gpu-operator                                gpu-feature-discovery-rcdcf                                             0/1     Init:0/1                     0                  162m    10.130.2.16     cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com     <none>           <none>
nvidia-gpu-operator                                nvidia-container-toolkit-daemonset-lxpm7                                0/1     Init:0/1                     0                  162m    10.130.2.17     cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com     <none>           <none>
nvidia-gpu-operator                                nvidia-dcgm-exporter-n227d                                              0/1     Init:0/1                     0                  162m    169.60.156.4    cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com     <none>           <none>
nvidia-gpu-operator                                nvidia-dcgm-kr6v8                                                       0/1     Init:0/1                     0                  162m    169.60.156.4    cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com     <none>           <none>
nvidia-gpu-operator                                nvidia-device-plugin-daemonset-stgwm                                    0/1     Init:0/1                     0                  162m    10.130.2.19     cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com     <none>           <none>
nvidia-gpu-operator                                nvidia-driver-daemonset-kz5th                                           0/1     CrashLoopBackOff             294 (53s ago)      26h     10.130.2.2      cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com     <none>           <none>
nvidia-gpu-operator                                nvidia-node-status-exporter-rcmdj                                       1/1     Running                      3                  28h     10.130.2.5      cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com     <none>           <none>
nvidia-gpu-operator                                nvidia-operator-validator-hpkvt                                         0/1     Init:0/4                     0                  162m    10.130.2.18     cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com     <none>           <none>

The output of the failing pod shows the same error as before:

+ echo 'Installing elfutils...'
156
Installing elfutils...
157
+ dnf install -q -y elfutils-libelf.x86_64 elfutils-libelf-devel.x86_64
158
Error: Unable to find a match: elfutils-libelf-devel.x86_64
159
FATAL: failed to install elfutils packages. RHEL entitlement may be improperly deployed.
160
+ echo 'FATAL: failed to install elfutils packages. RHEL entitlement may be improperly deployed.'
161
+ exit 1
162
++ rm -rf /tmp/tmp.AIojKsyUdp

haircommander commented 7 months ago

@jmkanz a couple of things:

a better forum may be (if possible) an openshift Jira ticket, as this forum is really more for upstream cri-o, and these versions are out of upstream support
can you help me put together a more minimal reproducer? I attempted to install the nvidia operator, and created a clusterpolicy and nvidia driver instance, but I wonder if I did the right steps as I'm getting different failures (and I doubt the cluster I installed has GPUs to provision)
- I also tried to use a ubi8 image and I was able to install packages (elfutils was installed in ubi8 base, but I could install other packages, and I could also install it in ubi8-minimal with microdnf). I do get warnings about not having entitlement certs (Found 0 entitlement certificates), but that's a different one than you are hitting

shivamerla commented 7 months ago

@jmkanz can you post the status of all pods in the cluster please (specially coredns).

cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com     Ready,SchedulingDisabled   canary,worker   27h     v1.25.16+9946c63   169.60.156.4    <none>        Red Hat Enterprise Linux CoreOS 412.86.202403280709-0 (Ootpa)   4.18.0-372.98.1.el8_6.x86_64   cri-o://1.25.5-13.1.rhaos4.12.git76343da.el8

GPU Operator does seem to cordon the node in this case, so wondering if any networking pods are being evicted, which will cause the driver install to fail.

jmkanz commented 7 months ago

@shivamerla - I've manually cordoned this node since i've updated it to the latest version of Open Shift 4.12

The cordon should not impact the nvidia pods as they run as daemonsets. I've cordoned other nodes in the cluster as well (that are on a older version of CoreOS) and they run fine with or without the cordon.

Additionally, other pods are running fine on the node. I can give you an output of them. Please see below: edit to sanitize IP's

oc get pods -A -o wide |grep cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com
ibm-object-s3fs                                    ibmcloud-object-storage-driver-l925j                                    1/1     Running                      0                   126m    
ibm-observe                                        logdna-agent-8xssh                                                      1/1     Running                      3                   2d3h    
ibm-observe                                        sysdig-agent-n6v9s                                                      1/1     Running                      3                   2d3h    
jeg                                                kernel-image-puller-5bfpp                                               1/1     Running                      3                   2d3h    
kube-system                                        istio-cni-node-cl5sj                                                    1/1     Running                      3                   2d3h    
nvidia-gpu-operator                                gpu-feature-discovery-rcdcf                                             0/1     Init:0/1                     0                   25h     
nvidia-gpu-operator                                nvidia-container-toolkit-daemonset-lxpm7                                0/1     Init:0/1                     0                  
nvidia-gpu-operator                                nvidia-dcgm-exporter-n227d                                              0/1     Init:0/1                     0                   25h     
nvidia-gpu-operator                                nvidia-dcgm-kr6v8                                                       0/1     Init:0/1                     0                   25h     
nvidia-gpu-operator                                nvidia-device-plugin-daemonset-stgwm                                    0/1     Init:0/1                     0                   25h     
nvidia-gpu-operator                                nvidia-driver-daemonset-kz5th                                           0/1     CrashLoopBackOff             558 (14s ago)       2d1h    
nvidia-gpu-operator                                nvidia-node-status-exporter-rcmdj                                       1/1     Running                      3                   2d3h    
nvidia-gpu-operator                                nvidia-operator-validator-hpkvt                                         0/1     Init:0/4                     0                   25h    
openshift-cluster-node-tuning-operator             tuned-vnkmb                                                             1/1     Running                      1                   47h     
openshift-dns                                      dns-default-4vdm9                                                       2/2     Running                      2                   47h     
openshift-dns                                      node-resolver-l5vtp                                                     1/1     Running                      1                   47h     
openshift-image-registry                           node-ca-zmnhr                                                           1/1     Running                      1                   47h     
openshift-ingress-canary                           ingress-canary-xrdwz                                                    1/1     Running                      1                   47h     
openshift-machine-config-operator                  machine-config-daemon-bpj6x                                             2/2     Running                      2                   
openshift-monitoring                               node-exporter-djqxn                                                     2/2     Running                      2                   47h     
openshift-multus                                   multus-additional-cni-plugins-zhw2l                                     1/1     Running                      1                   47h     
openshift-multus                                   multus-ddwks                                                            1/1     Running                      1                   47h    
openshift-multus                                   network-metrics-daemon-cbj56                                            2/2     Running                      2                   47h     
openshift-network-diagnostics                      network-check-target-c6snb                                              1/1     Running                      1                   47h     
openshift-nfd                                      nfd-worker-pflkc                                                        1/1     Running                      3                   2d3h    
openshift-sdn                                      sdn-j7kng                                                               2/2     Running                      2                   47h     
openshift-storage                                  csi-cephfsplugin-fndsj                                                  2/2     Running                      6                   2d3h   
openshift-storage                                  csi-rbdplugin-5c6ml                                                     3/3     Running                      9                   2d3h    
tekton-pipelines                                   pwa-2r24x                                                               1/1     Running                      3                   2d3h

DNS Pods for the cluster as well:

╰$ oc get pods -A |grep dns
openshift-dns-operator                             dns-operator-7f86f6f997-766l4                                           2/2     Running                      0                   47h
openshift-dns                                      dns-default-4n6hq                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-4vdm9                                                       2/2     Running                      2                   47h
openshift-dns                                      dns-default-7v9rx                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-9wwps                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-bcv7s                                                       2/2     Running                      2                   47h
openshift-dns                                      dns-default-bzsmp                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-csrpd                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-d677l                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-dv45x                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-j7xcv                                                       2/2     Running                      4                   47h
openshift-dns                                      dns-default-jb62l                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-lkq76                                                       2/2     Running                      2                   47h
openshift-dns                                      dns-default-lpsfq                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-m6hr9                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-pf825                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-tj4bw                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-zjpsz                                                       2/2     Running                      2                   47h
openshift-dns                                      dns-default-zl52j                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-zsgx8                                                       2/2     Running                      0                   47h
openshift-dns                                      node-resolver-2vsc8                                                     1/1     Running                      1                   47h
openshift-dns                                      node-resolver-48nkh                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-59vb8                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-74btd                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-c5d4p                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-c5q44                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-clck8                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-fjgnb                                                     1/1     Running                      1                   47h
openshift-dns                                      node-resolver-g54sd                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-gd6rk                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-gs4z2                                                     1/1     Running                      2                   47h
openshift-dns                                      node-resolver-l5n4z                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-l5vtp                                                     1/1     Running                      1                   47h
openshift-dns                                      node-resolver-l9kjc                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-rls8p                                                     1/1     Running                      1                   47h
openshift-dns                                      node-resolver-tr6wf                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-vlnj4                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-whzp6                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-wrnfs                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-xqwxs                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-zhb5c                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-zm4m8                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-zph8p                                                     1/1     Running                      0                   47h

jmkanz commented 7 months ago

@jmkanz a couple of things:

a better forum may be (if possible) an openshift Jira ticket, as this forum is really more for upstream cri-o, and these versions are out of upstream support

can you help me put together a more minimal reproducer? I attempted to install the nvidia operator, and created a clusterpolicy and nvidia driver instance, but I wonder if I did the right steps as I'm getting different failures (and I doubt the cluster I installed has GPUs to provision)

I also tried to use a ubi8 image and I was able to install packages (elfutils was installed in ubi8 base, but I could install other packages, and I could also install it in ubi8-minimal with microdnf). I do get warnings about not having entitlement certs (Found 0 entitlement certificates), but that's a different one than you are hitting

Hey @haircommander - Thanks for your reply. We can move the conversation over to the other git issue in the CRI-O repo if you prefer? This is the NVIDIA one. Additionally, if your cluster doesn't have GPU's I doubt the install will even begin since you need the correct labels from NFD operator for GPU enabled workers.

I believe @KodieGlosserIBM has the issue open in JIRA with Red Hat still

haircommander commented 7 months ago

ah I thought I was commenting there :upside_down_face: . this is fine too if this feels right.

Still wondering about a more minimal reproducer, potentially without nvidia operator in the picture. Or, if you could help me get access to the environment with this failing, that would work too

jmkanz commented 7 months ago

This seems to be resolved. I noticed this cluster had a ClusterPolicy that was not using the OCP Driver Toolkit which prevents these entitlement issues.

Please see links below for more information

Open Shift Driver Toolkit Info: https://docs.openshift.com/container-platform/4.12/hardware_enablement/psap-driver-toolkit.html

NVIDIA Docs on Installation with / without Driver Toolkit: https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/steps-overview.html

kwilczynski commented 7 months ago

@jmkanz and @KodieGlosserIBM, thank you for the update!

Good to know that things are working fine. :tada:

kwilczynski commented 6 months ago

Hello everyone! :wave: Are we still having issues with the operator installation on CRI-O 1.25 and 1.26?

I think this problem has been resolved, and we could close this issue? Thoughts?

kwilczynski commented 4 months ago

Do we still need this issue to be open? Any more troubles? I believe we can resolve it now.

NVIDIA / gpu-operator

Latest CRI-O (on 1.25/1.26) failing to install gpu-operator #680

tl;dr at the bottom

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

4. Information to attach (optional if deemed irrelevant)

tl;dr