NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.83k stars 297 forks source link

Cannot establish GPU-operator with GDRDMA #588

Open ReyRen opened 1 year ago

ReyRen commented 1 year ago

1. Quick Debug Information

2. Issue or feature description

When I was attempting to use the GDRDMA feature, I followed the deployment instructions described in GPU-operator. I have already installed the OFED driver on my physical machine (non-containerized form), so I set the parameters "--set driver.rdma.enabled=true --set driver.rdma.useHostMofed=true." But the Driver-daemon pod get error:

图片

Here are the pod status:

图片

4. Information to attach (optional if deemed irrelevant)

图片 图片

Full debug bundle already send to *operator_feedback@nvidia.com**

shivamerla commented 1 year ago

@ReyRen from the debug bundle provided looks like driver pod logs are truncated. Can you get logs from "nvidia-driver-ctr" container within the driver pod. Looks like NVIDIA driver install is not going through. Attaching logs from dmesg also will help.

ruta-04 commented 8 months ago

I am also facing a similar issue. In my case, I want to enable RDMA and disable useHostMofed for Network Operator installation on Openshift:

[https://docs.nvidia.com/networking/display/cokan10/network+operator#src-39285883_NetworkOperator-DOCP]

Apart from the GPU-operator and monitoring pods, all others are stuck in Init state.