NVIDIA / k8s-driver-manager

The NVIDIA Driver Manager is a Kubernetes component which assist in seamless upgrades of NVIDIA Driver on each node of the cluster.
Apache License 2.0
33 stars 12 forks source link

RedHat9.2 exec k8s-driver-manager error #37

Closed lengrongfu closed 3 weeks ago

lengrongfu commented 2 months ago

I use gpu-operator:v23.9.0 to install nvidia gpu driver, but nvidia-driver-daemonset pod start after, the machine will kernel crash.

I use GPU car is Tesla P4.

os info: Red Hat9.2, kernel version is 5.14.0-284.11.1.el9_2.x86_64.

machine is install nouveau driver, and i use dmesg command to look kernel log, found having many error about nouveau:

image
lengrongfu commented 2 months ago

@cdesiniotis Have you seen this error?

cdesiniotis commented 2 months ago

@lengrongfu I am not familiar. It is recommended to blacklist nouveau as it can conflict with the nvidia driver.

lengrongfu commented 2 months ago

I am using gpu-operator to install the driver. Do I need to manually add nouveau to the blacklist before installing gpu-operator?

k8s-driver-manager pod exec rmmod nouveau error. https://github.com/NVIDIA/k8s-driver-manager/blob/659892aea6af4442e6e63b8a97cadc838c84782c/driver-manager#L494

cdesiniotis commented 2 months ago

I am using gpu-operator to install the driver. Do I need to manually add nouveau to the blacklist before installing gpu-operator?

This is not a required pre-requisite, but because you are seeing errors from nouveau I recommended that you try blacklisting it. Like you pointing out, we do take care of unloaded in the module.

lengrongfu commented 2 months ago

Ok, thanks, i exec blacklist nouveau after, k8s-driver-manager can exec success,

lengrongfu commented 2 months ago

@cdesiniotis Let's discuss whether it is possible to develop a new feature to add an option to k8s-driver-manager to perform the operation of blacklist nouveau

cdesiniotis commented 3 weeks ago

Since blacklisting would require updating the initramfs and rebooting the node, it is not something we would be open to adding to this component. This should be done during infrastructure provisioning.