Open nadav213000 opened 1 year ago
@cdesiniotis ^^
@nadav213000 can you collect the nvidia-vgpu-manager logs from the host:
grep vmiop_log: /var/log/messages
If these logs are not visible on the host, then try collecting this from within openshift-driver-toolkit-ctr
in the nvidia-vgpu-manager-daemonset
pod.
@nadav213000 can you collect the nvidia-vgpu-manager logs from the host:
grep vmiop_log: /var/log/messages
If these logs are not visible on the host, then try collecting this from within
openshift-driver-toolkit-ctr
in thenvidia-vgpu-manager-daemonset
pod.
@cdesiniotis There is no /var/log/path on the Nodes. I've searched the journal for such logs, but couldn't find anything. Are there other logs I should look for?
The Node OS version is RHCOS 4.10
The below error suggests that the nvidia-vgpu-vfio module cannot communicate with the nvidia-vgpu-manager
[] [nvidia-vgpu-vfio] a581675d-19eb-4f50-aa64-6aeb378c58c3: start failed. status: 0x0 Timeout Occured
We need to get logs from nvidia-vgpu-manager to debug this further. Here is the official vGPU documentation for how to gather logs: https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#examine-vgpu-manager-messages
Can you try running a virtual machine again and running the below shortly after?
journalctl | grep vmiop
The below error suggests that the nvidia-vgpu-vfio module cannot communicate with the nvidia-vgpu-manager
[] [nvidia-vgpu-vfio] a581675d-19eb-4f50-aa64-6aeb378c58c3: start failed. status: 0x0 Timeout Occured
We need to get logs from nvidia-vgpu-manager to debug this further. Here is the official vGPU documentation for how to gather logs: https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#examine-vgpu-manager-messages
Can you try running a virtual machine again and running the below shortly after?
journalctl | grep vmiop
I have tried to run this command but couldn't find any logs with the vmiop. I searched in the other logs in /var/log
directory but still couldn't find logs with vmiop
.
All the pods are in a running state, how can I validate that the nvidia-vgpu-manager is actually running and functioning?
I see. Can you try adding the following volume mount to the nvidia-vgpu-manager daemonset and check if you can see these logs? https://github.com/NVIDIA/gpu-operator/commit/4edef0f84b8418e553f1aaeb60bbe5f4e47120c3
To check if nvidia-vgpu-manager
is running, you can run ps aux | grep nvidia-vgpu-manager
on the host.
I see. Can you try adding the following volume mount to the nvidia-vgpu-manager daemonset and check if you can see these logs? 4edef0f
I added the volume mount and I ran the following command:
journalctl -D /var/log/journal/ | grep -i vmiop
and there are three logs that repeat constantly:
Jul 17 02:46:22 <node> nvidia-vgpu-mgr[310831]: notice: vmiop_env_log: vmiop-env: guest_max_gpfn:0x0
Jul 17 02:46:22 <node> nvidia-vgpu-mgr[310831]: error: vmiop_env_log: (0x0): Failed to create vGPU character device with minor number 1 error 0x26
Jul 17 02:46:22 <node> nvidia-vgpu-mgr[310831]: error: vmiop_env_log: error: failed to notify VM start operation information: 59
Jul 17 02:47:31 <node> nvidia-vgpu-mgr[312671]: notice: vmiop_env_log: vmiop-env: guest_max_gpfn:0x0
Jul 17 02:47:31 <node> nvidia-vgpu-mgr[312671]: error: vmiop_env_log: (0x0): Failed to create vGPU character device with minor number 2 error 0x26
Jul 17 02:47:31 <node> nvidia-vgpu-mgr[312671]: error: vmiop_env_log: error: failed to notify VM start operation information: 59
it's the same node in all the logs
There is also an error log at the beginning
Jul 16 10:36:59 <node> nvidia-vgpu-mgr[3175360]: error: vmiop_env_log: Failed to attach device: 0x26 (gpuId 0x4100)
Jul 16 10:37:05 <node> nvidia-vgpu-mgr[3175737]: notice: vmiop_env_log: nvidia-vgpu-mgr daemon started
Thanks @nadav213000. Can you try rebuilding the vgpu-manager
container image from the latest changes in the driver container repository: https://gitlab.com/nvidia/container-images/driver
There was a bug fix that was recently merged into the vgpu-manager
container scripts that may resolve this issue: https://gitlab.com/nvidia/container-images/driver/-/commit/94324dc6dbaff191b72a734b7734710110d82198
Thanks @nadav213000. Can you try rebuilding the
vgpu-manager
container image from the latest changes in the driver container repository: https://gitlab.com/nvidia/container-images/driverThere was a bug fix that was recently merged into the
vgpu-manager
container scripts that may resolve this issue: https://gitlab.com/nvidia/container-images/driver/-/commit/94324dc6dbaff191b72a734b7734710110d82198
Thanks that solved the issue!
After that, I tried to deploy a Windows Server 2016 VM and assign the vGPU to it, but there was an error in the console that guest has not initialized the display (yet)
as described in this issue https://github.com/kubevirt/kubevirt/issues/7245.
I changed the following in the VM spec:
devices:
gpus:
- deviceName: nvidia.com/NVIDIA_A10-12Q
name: nvidia-10a
virtualGPUOptions:
display:
enabled: true
ramFB:
enabled: false
After setting the ramFB
setting to false
I could access the UI console in OpenShift.
After I installed the NVIDIA vGPU driver and configured its license I couldn't access the UI console anymore and it only showed a black screen.
Do you know why I had to set the ramFB
to false for the VM console to be available? Should I increase the VM RAM?
And do you know why after installing the vGPU driver the console is no longer available or how can I investigate this problem?
@nadav213000 Can you try adding the below vGPU plugin option on the host?
echo "disable_vnc=0" > /sys/bus/mdev/devices/<vGPU uuid>/nvidia/vgpu_params
@nadav213000 Can you try adding the below vGPU plugin option on the host?
echo "disable_vnc=0" > /sys/bus/mdev/devices/<vGPU uuid>/nvidia/vgpu_params
I have tried to run this command from the vgpu-manager-daemonset pod, but I get the following error:
sh: echo: write error: Operation not permitted
I also tried to run that command from the Node itself but got the same error.
How should I run this command?
@nadav213000 Can you try adding the below vGPU plugin option on the host?
echo "disable_vnc=0" > /sys/bus/mdev/devices/<vGPU uuid>/nvidia/vgpu_params
I have tried to run this command from the vgpu-manager-daemonset pod, but I get the following error:
sh: echo: write error: Operation not permitted
I also tried to run that command from the Node itself but got the same error.
How should I run this command?
I had to stop the VM that using the vGPU before changing the vgpu_params
values.
But still, after changing the value, I still see a black screen in the OpenShift Console.
Can I provide you with some logs to help identify the issue?
@nadav213000 are you using the desktop viewer option (RDP) in OSV? This is the recommended method for accessing Window's VMs, as documented here https://docs.openshift.com/container-platform/4.12/virt/virtual_machines/virt-accessing-vm-consoles.html. After installing the vGPU guest driver, the VNC consoles may not show anything. I have been told RDP should work.
1. Issue or feature description
We try to configure OpenShift environment to use NVIDIA vGPU using the NVIDIA gpu operator. We followed the steps as described in this guide in NVIDIA vgpu documentation.
I have a trail license from NVIDIA to use the vGPU software from their portal. I downloaded the
Linux KVM all supported
(should I install the RHEL drivers instaed?) version13.7
. As described in the tutorial, i extracted theNVIDIA-Linux-x86_64-470.182.02-vgpu-kvm.run
and build the image using the drivers repository.I have also configured the
ClusterPolicy
, and all the pods, related to the GPU Operator, are inRunning
state. I configured CNV (Kubevirt) on the cluster, I edited theHyperConverged
to allow med device to allow using the use of the GPU for VMs (all the steps as described in NVIDIA GPU Operator guide)I try to deploy a new VM with RHEL 7.9 (looks like this version is supported in driver version 13.7 in the documentation)
When I try to create the VM configured as follow:
It fails to start and there are the warning in the VM events:
In the VM virt-launcher pod I get the following errors:
and finally the virt-launcher fails, and so does the VM.
I can't see any related logs in the Operator pods, only in the sandbox-device-plugin there is a log:
The sandbox also showed this error when it started, but it's in running state:
In the Host itself using
dmesg
I can see the following error:Running
nvidia-smi vgpu
fromnvidia-vgpu-manager-daemonset
pod withinopenshift-driver-toolkit-ctr
2. Environment details
22.9.1
4.10.1
3. Steps to reproduce the issue
4.10
on bare metal nodesDo you have a suggestion why does this behavior happen? Let me know if I can provide any additional information.