Open arpitsharma-vw opened 9 months ago
@shivamerla Hi, Can you help here? Many thanks :)
@arpitsharma-vw can you check dmesg
on the node and report any driver errors. dmesg | grep -i nvrm
. If you see GSP RM related errors please try this workaround to disable GSP RM.
please try this workaround to disable GSP RM.
@shivamerla What if the driver is already installed (as is the case with EKS GPU AMI), will the driver component still try and apply the kernel module config?
Many thanks @shivamerla for your input. I can confirm that we see GSP RM related errors here. But regarding the fix, we have installed the GPU operator via OLM(not Helm). I am afraid that these changes will get wiped out again on the next upgrade.
Let me explain how this can be done on OpenShift:
First, create a ConfigMap
as described in the doc for disabling GSP RM.
oc create configmap kernel-module-params -n nvidia-gpu-operator --from-file=nvidia.conf=./nvidia.conf
Then add the following to the ClusterPolicy
:
driver:
<...>
kernelModuleConfig:
name: kernel-module-params
<...>
You can do it either via the Web console, or using this command:
oc patch clusterpolicy/gpu-cluster-policy -n nvidia-gpu-operator --type='json' -p='[{"op": "add", "path": "/spec/driver/kernelModuleConfig/name", "value":"kernel-module-params"}]'
Essentially, the outcome should be the same, no matter if done via Helm or using the method I described. That is, the ClusterPolicy
resource will have the right section added to it. The oc patch
command above assumes that there is already a ClusterPolicy
resource, but you can also add the required kernelModuleConfig
section right away when creating the ClusterPolicy
(via the Web console or from a file).
I believe that the changes will persist as they will be part of the ClusterPolicy
. Also, the operator will probably restart the driver to pick up the changes. Please correct me if I'm wrong @shivamerla
Same here. I'm using EKS 1.29 with the latest AMI with "a fix" https://github.com/awslabs/amazon-eks-ami/issues/1494#issuecomment-1969724714. Gpu-operator v23.9.1
Even the DCGM exporter failed to start with a message:
Message: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: detection error: nvml error: unknown error: unknown
After applying a suggest fix with disable of GSP:
Warning UnexpectedAdmissionError 11s kubelet Allocate failed due to device plugin GetPreferredAllocation rpc failed with err: rpc error: code = Unknown desc = error getting list of preferred allocation devices: unable to retrieve list of available devices: error creating nvml.Device 0: nvml: Unknown Error, which is unexpected
1. Quick Debug Information
2. Issue or feature description
We have openshift cluster where we have installed nvidia gpu operator. When we run any pod on G5.48xlarge machine, we get error as
Same pod on other machine like g5.4xlarge,g5.12xlarge works well. We see this behaviour recently. Earlier same pod worked on g5.48xlarge instance.
We also see pod from nvidia-dcgm-exporter is failing with following error:
3. Steps to reproduce the issue
Assign pod on g5.48xlarge works, but it doesn't run
4. Information to attach (optional if deemed irrelevant)
kubectl get pods -n OPERATOR_NAMESPACE
kubectl get ds -n OPERATOR_NAMESPACE
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
nvidia-smi
from the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
journalctl -u containerd > containerd.log
Logs from nvidia-dcgm-exporter pod
Logs from GPU feature discovery pod:
GPU cluster policy
Collecting full debug bundle (optional):
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: operator_feedback@nvidia.com