Closed mbio16 closed 1 year ago
@mbio16 Can you attach the output of "journalctl -xb" on this node, that might indicate the actual error when we attempt to load nvidia
modules. Also, can you share more details on the system? Which hypervisor and version? EFI boot?
I will.
system info:
The VM has to be configured with EFI boot and following PCI params are required for VM config. The above error during driver load is seen without these settings.
pciPassthru.use64bitMMIO=”TRUE”
pciPassthru.64bitMMIOSizeGB=128
@shivamerla PCI params has been set for these VMs with BIOS booting. I tried to make new VM node and used EFI boot. However, the VM starts and shuts down Immediately. Tried with secure boot option and without. Both has the same result. When I set BIOS, Core os boot normally. ISO for booting is the one generated by OpenShift page - Install OpenShift with the Assisted Installer.
Hi,
the solution is to run EFI without secure boot. BIOS mode caused kernel compile error. EFI boot with secure boot caused error connected to driver signature. EFI boot without secure option is valid worker install that will work with NVIDIA GPU operator.
Hope this comment helps more admins from struggling.
1. Quick Debug Checklist
1. Issue or feature description
Pod nvidia-driver-daemonset contains container openshift-driver-toolkit-ctr which want to compile driver for Nvidia GPU version 515.65.01. Driver install fails. Logs from container:
2. Steps to reproduce the issue