Mellanox / network-operator

Mellanox Network Operator
Apache License 2.0
202 stars 49 forks source link

MOFED Pod CrashLoopBackOff State #830

Open ruta-04 opened 7 months ago

ruta-04 commented 7 months ago

I am using Nvidia Network Operator 23.10.0 on the Openshift Container Platform (rhcos 4.14).

While creating nicclusterpolicy, it spins up MOFED pods which are crashing continuously (running but in notReady state)

` [ansible@csah-pri entitlement]$ oc logs mofed-rhcos4.14-ds-ncj2n Unsetting driver ready state No OFED driver found for kernel 5.14.0-284.40.1.el9_2.x86_64 Enabling RHOCP and EUS RPM repos... ID="rhcos" VERSION_ID="4.14" RHEL_VERSION="9.2" Updating Subscription Management repositories. Unable to read consumer identity subscription-manager is operating in container mode. Updating Subscription Management repositories. Unable to read consumer identity subscription-manager is operating in container mode. cuda 589 B/s | 3.5 kB 00:06
cuda 294 kB/s | 1.2 MB 00:04
Red Hat Enterprise Linux 9 for x86_64 - AppStre 0.0 B/s | 0 B 00:10
Errors during downloading metadata for repository 'rhel-9-for-x86_64-appstream-rpms':

Installed: cryptsetup-libs-2.6.0-3.el9.x86_64
device-mapper-9:1.02.195-3.el9.x86_64
device-mapper-libs-9:1.02.195-3.el9.x86_64
dracut-057-44.git20230822.el9.x86_64
kbd-2.4.0-9.el9.x86_64
kbd-legacy-2.4.0-9.el9.noarch
kbd-misc-2.4.0-9.el9.noarch
kernel-5.14.0-284.40.1.el9_2.x86_64
kernel-core-5.14.0-284.40.1.el9_2.x86_64
kernel-modules-5.14.0-284.40.1.el9_2.x86_64
kernel-modules-core-5.14.0-284.40.1.el9_2.x86_64
kpartx-0.8.7-22.el9.x86_64
libkcapi-1.3.1-3.el9.x86_64
libkcapi-hmaccalc-1.3.1-3.el9.x86_64
linux-firmware-20230310-138.el9_2.noarch
linux-firmware-whence-20230310-138.el9_2.noarch
pigz-2.5-4.el9.x86_64
systemd-udev-252-18.el9.x86_64

Downgraded: elfutils-0.188-3.el9.x86_64
elfutils-debuginfod-client-0.188-3.el9.x86_64
elfutils-libelf-0.188-3.el9.x86_64
elfutils-libs-0.188-3.el9.x86_64
Installed: createrepo_c-0.20.1-1.el9.x86_64
createrepo_c-libs-0.20.1-1.el9.x86_64
elfutils-libelf-devel-0.188-3.el9.x86_64
kernel-rpm-macros-185-12.el9.noarch
numactl-libs-2.0.16-1.el9.x86_64
zlib-devel-1.2.11-40.el9.x86_64
Installing Linux kernel headers... Error: Unable to find a match: kernel-headers-5.14.0-284.40.1.el9_2.x86_64 kernel-devel-5.14.0-284.40.1.el9_2.x86_64

Command "dnf -q -y --releasever=9.2 install kernel-headers-5.14.0-284.40.1.el9_2.x86_64 kernel-devel-5.14.0-284.40.1.el9_2.x86_64" failed with exit code: 1 Terminate event caught Terminating container Unsetting driver ready state Keeping currently loaded Mellanox OFED Driver...`

I can't seem to pinpoint the issue here. is it that os's kernel version is not currently supported by OFED driver?

rollandf commented 6 months ago

Hi @ruta-04, did you configure cluster-wide-entitlement? See https://docs.nvidia.com/networking/display/kubernetes2310/network+operator#src-144713486_NetworkOperator-Cluster-wideEntitlement

ruta-04 commented 6 months ago

@rollandf

yes, I added the cluster-wide-entitlement and ran the test pod provided in the instructions which gave the following output. It matches the example output.

[ansible@csah-pri entitlement]$ oc logs cluster-entitled-build-pod -n default Updating Subscription Management repositories. Unable to read consumer identity subscription-manager is operating in container mode. Red Hat Enterprise Linux 9 for x86_64 - BaseOS 14 MB/s | 17 MB 00:01 Red Hat Enterprise Linux 9 for x86_64 - AppStre 25 MB/s | 29 MB 00:01 Red Hat Universal Base Image 9 (RPMs) - BaseOS 481 kB/s | 515 kB 00:01 Red Hat Universal Base Image 9 (RPMs) - AppStre 2.2 MB/s | 1.8 MB 00:00 Red Hat Universal Base Image 9 (RPMs) - CodeRea 321 kB/s | 192 kB 00:00 ====================== Name Exactly Matched: kernel-devel ====================== kernel-devel-5.14.0-70.13.1.el9_0.x86_64 : Development package for building kernel modules to match the kernel kernel-devel-5.14.0-70.17.1.el9_0.x86_64 : Development package for building kernel modules to match the kernel kernel-devel-5.14.0-70.22.1.el9_0.x86_64 : Development package for building kernel modules to match the kernel kernel-devel-5.14.0-70.26.1.el9_0.x86_64 : Development package for building kernel modules to match the kernel kernel-devel-5.14.0-70.30.1.el9_0.x86_64 : Development package for building kernel modules to match the kernel kernel-devel-5.14.0-162.6.1.el9_1.x86_64 : Development package for building kernel modules to match the kernel kernel-devel-5.14.0-162.23.1.el9_1.x86_64 : Development package for building kernel modules to match the kernel kernel-devel-5.14.0-162.12.1.el9_1.x86_64 : Development package for building kernel modules to match the kernel kernel-devel-5.14.0-162.22.2.el9_1.x86_64 : Development package for building kernel modules to match the kernel kernel-devel-5.14.0-162.18.1.el9_1.x86_64 : Development package for building kernel modules to match the kernel kernel-devel-5.14.0-284.11.1.el9_2.x86_64 : Development package for building kernel modules to match the kernel kernel-devel-5.14.0-284.25.1.el9_2.x86_64 : Development package for building kernel modules to match the kernel kernel-devel-5.14.0-284.18.1.el9_2.x86_64 : Development package for building kernel modules to match the kernel kernel-devel-5.14.0-284.30.1.el9_2.x86_64 : Development package for building kernel modules to match the kernel kernel-devel-5.14.0-362.8.1.el9_3.x86_64 : Development package for building kernel modules to match the kernel kernel-devel-5.14.0-362.13.1.el9_3.x86_64 : Development package for building kernel modules to match the kernel kernel-devel-5.14.0-362.18.1.el9_3.x86_64 : Development package for building kernel modules to match the kernel ========================== Name Matched: kernel-devel ========================== kernel-devel-matched-5.14.0-70.13.1.el9_0.x86_64 : Meta package to install matching core and devel packages for a given kernel kernel-devel-matched-5.14.0-70.17.1.el9_0.x86_64 : Meta package to install matching core and devel packages for a given kernel kernel-devel-matched-5.14.0-70.26.1.el9_0.x86_64 : Meta package to install matching core and devel packages for a given kernel kernel-devel-matched-5.14.0-70.22.1.el9_0.x86_64 : Meta package to install matching core and devel packages for a given kernel kernel-devel-matched-5.14.0-70.30.1.el9_0.x86_64 : Meta package to install matching core and devel packages for a given kernel kernel-devel-matched-5.14.0-162.6.1.el9_1.x86_64 : Meta package to install matching core and devel packages for a given kernel kernel-devel-matched-5.14.0-162.23.1.el9_1.x86_64 : Meta package to install matching core and devel packages for a given kernel kernel-devel-matched-5.14.0-162.22.2.el9_1.x86_64 : Meta package to install matching core and devel packages for a given kernel kernel-devel-matched-5.14.0-162.12.1.el9_1.x86_64 : Meta package to install matching core and devel packages for a given kernel kernel-devel-matched-5.14.0-162.18.1.el9_1.x86_64 : Meta package to install matching core and devel packages for a given kernel kernel-devel-matched-5.14.0-284.11.1.el9_2.x86_64 : Meta package to install matching core and devel packages for a given kernel kernel-devel-matched-5.14.0-284.25.1.el9_2.x86_64 : Meta package to install matching core and devel packages for a given kernel kernel-devel-matched-5.14.0-284.18.1.el9_2.x86_64 : Meta package to install matching core and devel packages for a given kernel kernel-devel-matched-5.14.0-362.8.1.el9_3.x86_64 : Meta package to install matching core and devel packages for a given kernel kernel-devel-matched-5.14.0-284.30.1.el9_2.x86_64 : Meta package to install matching core and devel packages for a given kernel kernel-devel-matched-5.14.0-362.13.1.el9_3.x86_64 : Meta package to install matching core and devel packages for a given kernel kernel-devel-matched-5.14.0-362.18.1.el9_3.x86_64 : Meta package to install matching core and devel packages for a given kernel

ruta-04 commented 6 months ago

any update on this?

rollandf commented 6 months ago

Looks that entitlement is setup OK. Did you restart MOFED after setting up the entitlement?

BTW, in 24.1 version, entitlement is not needed as compilation is done with DTK.

ruta-04 commented 6 months ago

I setup the entitlement before starting the mofed pods.

Anything else we can try?

On Sun, Mar 10, 2024 at 12:30 AM Fred Rolland @.***> wrote:

Looks that entitlement is setup OK. Did you restart MOFED after setting up the entitlement?

BTW, in 24.1 version, entitlement is not needed as compilation is done with DTK.

— Reply to this email directly, view it on GitHub https://github.com/Mellanox/network-operator/issues/830#issuecomment-1987107055, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALJO5CY65MGGBQA3CXTDH5TYXP4ZZAVCNFSM6AAAAABDXLOAI2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBXGEYDOMBVGU . You are receiving this because you were mentioned.Message ID: @.***>

rollandf commented 6 months ago

@e0ne I remember that we encountered an issue where the Red Hat subscription should be of a certain type. Do you recall something related?

ruta-04 commented 5 months ago

Any update on this one?

On Tue, Mar 12, 2024 at 3:13 AM Fred Rolland @.***> wrote:

@e0ne https://github.com/e0ne I remember that we encountered an issue where the Red Hat subscription should be of a certain type. Do you recall something related?

— Reply to this email directly, view it on GitHub https://github.com/Mellanox/network-operator/issues/830#issuecomment-1991004095, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALJO5C26V7PJU4T6M6XSUMDYX22J5AVCNFSM6AAAAABDXLOAI2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJRGAYDIMBZGU . You are receiving this because you were mentioned.Message ID: @.***>

rollandf commented 5 months ago

Will you be able to use more recent Network Operator release? From 24.1, we no longer require cluster wide entitlement in OpenShift. Latest: https://docs.nvidia.com/networking/display/kubernetes2411