Mellanox / network-operator

Mellanox Network Operator
Apache License 2.0
190 stars 49 forks source link

MOFED pods: driver install fails on RHEL8.8 #888

Closed gseidlerhpe closed 1 week ago

gseidlerhpe commented 3 months ago

What happened: Deploy network operator on RHEL 8.8 hosts with option ofedDriver: deploy: true

ofed driver pods fail dues to error:

Error: Unable to find a match: kernel-4.18.0-477.10.1.el8_8.x86_64

Command "dnf -q -y --releasever=8.8 install kernel-4.18.0-477.10.1.el8_8.x86_64" failed with exit code: 1

What you expected to happen: ofed driver install succeeds on RHEL 8.8. Release notes for network operator v23.10.0 state that RHEL 8.8 is supported: https://docs.nvidia.com/networking/display/kubernetes2310/release+notes

How to reproduce it (as minimally and precisely as possible): Deploy network operator on RHEL 8.8 hosts with valid RHEl subscription.

Anything else we need to know?: Tried the option to specify private repo: ofedDriver.repoConfig.name network-operator pod log shows error:

2024-04-10T16:39:52Z ERROR Error while syncing state {"controller": "nicclusterpolicy", "controllerGroup": "mellanox.com", "controllerKind": "NicClusterPolicy", "NicClusterPolicy": {"name":"nic-cluster-policy"}, "namespace": "", "name": "nic-cluster-policy", "reconcileID": "d09bbc74-ce62-4fe4-9ccc-99838b245ed3", "error": "failed to create k8s objects from manifest: failed to get destination directory for custom repo config: distribution not supported", "errorVerbose": "failed to get destination directory for custom repo config: distribution not supported\nfailed to create k8s objects from manifest\ngithub.com/Mellanox/network-operator/pkg/state.(stateOFED).Sync\n\t/workspace/pkg/state/state_ofed.go:270\ngithub.com/Mellanox/network-operator/pkg/state.(stateManager).SyncState\n\t/workspace/pkg/state/manager.go:92\ngithub.com/Mellanox/network-operator/controllers.(NicClusterPolicyReconciler).Reconcile\n\t/workspace/controllers/nicclusterpolicy_controller.go:144\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598"} github.com/Mellanox/network-operator/pkg/state.(stateManager).SyncState /workspace/pkg/state/manager.go:101 github.com/Mellanox/network-operator/controllers.(NicClusterPolicyReconciler).Reconcile /workspace/controllers/nicclusterpolicy_controller.go:144 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Reconcile /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:122 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:323 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235

Got the ofed driver to install successfully by patching the mofed-rhel8.8-ds daemonset and adding these volumeMonts/volumes entries:

volumeMounts

  • mountPath: /run/secrets/etc-pki-entitlement name: subscription-config-0 readOnly: true
  • mountPath: /run/secrets/redhat.repo name: subscription-config-1 readOnly: true
  • mountPath: /run/secrets/rhsm name: subscription-config-2 readOnly: true

volumes

  • hostPath: path: /etc/pki/entitlement type: Directory name: subscription-config-0
  • hostPath: path: /etc/yum.repos.d/redhat.repo type: File name: subscription-config-1
  • hostPath: path: /etc/rhsm type: Directory name: subscription-config-2

Logs:

NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/cni-plugins-ds 4 4 4 4 4 45m daemonset.apps/kube-multus-ds 4 4 4 4 4 45m daemonset.apps/mofed-rhel8.8-ds 3 3 3 3 3 feature.node.kubernetes.io/pci-15b3.present=true,feature.node.kubernetes.io/system-os_release.ID=rhel,feature.node.kubernetes.io/system-os_release.VERSION_ID=8.8 45m daemonset.apps/nic-feature-discovery-ds 4 4 4 4 4 45m daemonset.apps/nv-ipam-node 4 4 4 4 4 45m daemonset.apps/rdma-shared-dp-ds 3 3 3 3 3 feature.node.kubernetes.io/pci-15b3.present=true,network.nvidia.com/operator.mofed.wait=false 45m

NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/network-operator 1/1 1 1 4d19h deployment.apps/nv-ipam-controller 2/2 2 2 45m

NAME DESIRED CURRENT READY AGE replicaset.apps/network-operator-5cbb6ccd74 0 0 0 4d19h replicaset.apps/network-operator-6444bc476f 1 1 1 4d15h replicaset.apps/network-operator-76b9994f84 0 0 0 4d19h replicaset.apps/nv-ipam-controller-64c89dcfd5 2 2 2 45m

- Network Operator version: v23.10.0
- Logs of Network Operator controller:
[mofed-rhel8.8-ds-h5bcr-success.log](https://github.com/Mellanox/network-operator/files/14935578/mofed-rhel8.8-ds-h5bcr-success.log)
[mofed-rhel8.8-ds-rqbfp-crash.log](https://github.com/Mellanox/network-operator/files/14935579/mofed-rhel8.8-ds-rqbfp-crash.log)
[network-operator-6444bc476f-g22tf.log](https://github.com/Mellanox/network-operator/files/14935580/network-operator-6444bc476f-g22tf.log)

- Logs of the various Pods in `nvidia-network-operator` namespace:
- Helm Configuration (if applicable):
custom-values.yaml

nfd: enabled: false deployNodeFeatureRules: true

operator: tolerations: [] affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution:

sriovNetworkOperator: enabled: false

NicClusterPolicy CR values:

deployCR: true

nvPeerDriver: deploy: false

rdmaSharedDevicePlugin: deploy: true resources:

secondaryNetwork: deploy: true multus: deploy: true cniPlugins: deploy: true ipamPlugin: deploy: false

nvIpam: deploy: true

sriovDevicePlugin: deploy: false

ofedDriver: deploy: true repoConfig: name: repo-config env:

nicFeatureDiscovery: deploy: true



- Kubernetes' nodes information (labels, annotations and status): `kubectl get node -o yaml`:

**Environment**:
- Kubernetes version (use `kubectl version`):  v1.27.10
- Hardware configuration:
  - Network adapter model and firmware version:

> 26:00.0 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
> 26:00.1 DMA controller [0801]: Mellanox Technologies MT43244 BlueField-3 SoC Management Interface [15b3:c2d5] (rev 01)
> 9f:00.0 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
> 9f:00.1 DMA controller [0801]: Mellanox Technologies MT43244 BlueField-3 SoC Management Interface [15b3:c2d5] (rev 01)
> b4:00.0 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
> b4:00.1 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
> b4:00.2 DMA controller [0801]: Mellanox Technologies MT43244 BlueField-3 SoC Management Interface [15b3:c2d5] (rev 01)

- OS (e.g: `cat /etc/os-release`):

> NAME="Red Hat Enterprise Linux"
> VERSION="8.8 (Ootpa)"
> ID="rhel"
> ID_LIKE="fedora"
> VERSION_ID="8.8"
> PLATFORM_ID="platform:el8"
> PRETTY_NAME="Red Hat Enterprise Linux 8.8 (Ootpa)"
> ANSI_COLOR="0;31"
> CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
> HOME_URL="https://www.redhat.com/"
> DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8"
> BUG_REPORT_URL="https://bugzilla.redhat.com/"
> 
> REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
> REDHAT_BUGZILLA_PRODUCT_VERSION=8.8
> REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
> REDHAT_SUPPORT_PRODUCT_VERSION="8.8"

- Kernel (e.g. `uname -a`):
4.18.0-477.10.1.el8_8.x86_64
- Others:
rollandf commented 3 months ago

Thanks for the report.

Which CRI are you using? For RHEL8/RHEL9, only CRIO is supported

noama-nv commented 3 months ago

on rhel you will have to use CRIO + containers-common installed to have the entitlement mounted

rollandf commented 1 week ago

Fixed with:

946

929