Problem configuring vGPU access using Kubevirt

nadav213000 commented 1 year ago

1. Issue or feature description

We try to configure OpenShift environment to use NVIDIA vGPU using the NVIDIA gpu operator. We followed the steps as described in this guide in NVIDIA vgpu documentation.

I have a trail license from NVIDIA to use the vGPU software from their portal. I downloaded the Linux KVM all supported (should I install the RHEL drivers instaed?) version 13.7. As described in the tutorial, i extracted the NVIDIA-Linux-x86_64-470.182.02-vgpu-kvm.run and build the image using the drivers repository.

I have also configured the ClusterPolicy, and all the pods, related to the GPU Operator, are in Running state. I configured CNV (Kubevirt) on the cluster, I edited the HyperConverged to allow med device to allow using the use of the GPU for VMs (all the steps as described in NVIDIA GPU Operator guide)

I try to deploy a new VM with RHEL 7.9 (looks like this version is supported in driver version 13.7 in the documentation)

When I try to create the VM configured as follow:

spec:
      domain:
        cpu:
          cores: 4
          sockets: 1
          threads: 1
        devices:
          disks:
            - disk:
                bus: virtio
              name: cloudinitdisk
            - bootOrder: 1
              disk:
                bus: virtio
              name: rootdisk
          gpus:
            - deviceName: nvidia.com/GRID_A100D-40C
              name: a100

It fails to start and there are the warning in the VM events:

Generated from virt-handler
4 times in the last 1 minute
unknown error encountered sending command SyncVMI: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Generated from virt-handler
7 times in the last 0 minutes
failed to detect VMI pod: dial unix //pods/efab0ff2-c256-49e3-9068-61bfec42dc49/volumes/kubernetes.io~empty-dir/sockets/launcher-sock: connect: connection refused

In the VM virt-launcher pod I get the following errors:

{"component":"virt-launcher","level":"warning","msg":"PCI_RESOURCE_NVIDIA_COM_GRID_A100D-40C not set for resource nvidia.com/GRID_A100D-40C","pos":"addresspool.go:50",}
{"component":"virt-launcher","level":"error","msg":"Unable to read from monitor: Connection reset by peer","pos":"qemuMonitorIORead:495","subcomponent":"libvirt","thread":"91",}
{"component":"virt-launcher","level":"error","msg":"At least one cgroup controller is required: No such device or address","pos":"virCgroupDetectControllers:455","subcomponent":"libvirt","thread":"45",}
{"component":"virt-launcher","level":"info","msg":"Process 08e8d621-0fa7-5488-9dd4-70540b814b5e and pid 86 is gone!","pos":"monitor.go:148",}
{"component":"virt-launcher","level":"info","msg":"Waiting on final notifications to be sent to virt-handler.","pos":"virt-launcher.go:277","}
{"component":"virt-launcher","level":"info","msg":"Timed out waiting for final delete notification. Attempting to kill domain","pos":"virt-launcher.go:297","timestamp":"2023-05-14T09:36:59.481498Z"}

and finally the virt-launcher fails, and so does the VM.

I can't see any related logs in the Operator pods, only in the sandbox-device-plugin there is a log:

2023/05/14 09:01:21 In allocate
2023/05/14 09:01:21 Allocated devices map[MDEV_PCI_RESOURCE_NVIDIA_COM_GRID_A100D-40C:01b514e9-2afb-4bd8-a82d-755eb045885a]

The sandbox also showed this error when it started, but it's in running state:

2023/05/11 12:25:46 GRID_A100D-40C Device plugin server ready
2023/05/11 12:25:46 healthCheck(GRID_A100D-40C): invoked
2023/05/11 12:25:46 healthCheck(GRID_A100D-40C): Loading NVML
2023/05/11 12:25:46 healthCheck(GRID_A100D-40C): Failed to initialize NVML: could not load NVML library

In the Host itself using dmesg I can see the following error:

[] [nvidia-vgpu-vfio] a581675d-19eb-4f50-aa64-6aeb378c58c3: ERESTARTSYS received during open, waiting for 25000 milliseconds for operation to complete
[] [nvidia-vgpu-vfio] a581675d-19eb-4f50-aa64-6aeb378c58c3: start failed. status: 0x0 Timeout Occured

Running nvidia-smi vgpu from nvidia-vgpu-manager-daemonset pod within openshift-driver-toolkit-ctr

$ nvidia-smi vgpu -q
GPU 00000000:82:00.0
    Active vGPUs                      : 0

$ nvidia-smi vgpu
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.182.02             Driver Version: 470.182.02                |
|---------------------------------+------------------------------+------------+
| GPU  Name                       | Bus-Id                       | GPU-Util   |
|      vGPU ID     Name           | VM ID     VM Name            | vGPU-Util  |
|=================================+==============================+============|
|   0  NVIDIA A10                 | 00000000:82:00.0             |   0%       |
+---------------------------------+------------------------------+------------+

2. Environment details

OpenShift version 4.10
Bare metal node with A100 GPU
GPU operator version: 22.9.1
OpenShift Virtualization: 4.10.1
vGPU driver version 13.7

3. Steps to reproduce the issue

Deploy OpenShift cluster in version 4.10 on bare metal nodes
Configure GPU Operator + vGPU configuration
Configure CNV operator
Create a VM with hardware resource of the new GPU.

Do you have a suggestion why does this behavior happen? Let me know if I can provide any additional information.

shivamerla commented 1 year ago

@cdesiniotis ^^

cdesiniotis commented 1 year ago

@nadav213000 can you collect the nvidia-vgpu-manager logs from the host:

grep vmiop_log: /var/log/messages

If these logs are not visible on the host, then try collecting this from within openshift-driver-toolkit-ctr in the nvidia-vgpu-manager-daemonset pod.

nadav213000 commented 1 year ago

@nadav213000 can you collect the nvidia-vgpu-manager logs from the host:
grep vmiop_log: /var/log/messages
If these logs are not visible on the host, then try collecting this from within openshift-driver-toolkit-ctr in the nvidia-vgpu-manager-daemonset pod.

@cdesiniotis There is no /var/log/path on the Nodes. I've searched the journal for such logs, but couldn't find anything. Are there other logs I should look for?

The Node OS version is RHCOS 4.10

cdesiniotis commented 1 year ago

The below error suggests that the nvidia-vgpu-vfio module cannot communicate with the nvidia-vgpu-manager

[] [nvidia-vgpu-vfio] a581675d-19eb-4f50-aa64-6aeb378c58c3: start failed. status: 0x0 Timeout Occured

We need to get logs from nvidia-vgpu-manager to debug this further. Here is the official vGPU documentation for how to gather logs: https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#examine-vgpu-manager-messages

Can you try running a virtual machine again and running the below shortly after?

journalctl | grep vmiop

nadav213000 commented 1 year ago

The below error suggests that the nvidia-vgpu-vfio module cannot communicate with the nvidia-vgpu-manager
[] [nvidia-vgpu-vfio] a581675d-19eb-4f50-aa64-6aeb378c58c3: start failed. status: 0x0 Timeout Occured
We need to get logs from nvidia-vgpu-manager to debug this further. Here is the official vGPU documentation for how to gather logs: https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#examine-vgpu-manager-messages

Can you try running a virtual machine again and running the below shortly after?
journalctl | grep vmiop

I have tried to run this command but couldn't find any logs with the vmiop. I searched in the other logs in /var/log directory but still couldn't find logs with vmiop. All the pods are in a running state, how can I validate that the nvidia-vgpu-manager is actually running and functioning?

cdesiniotis commented 1 year ago

I see. Can you try adding the following volume mount to the nvidia-vgpu-manager daemonset and check if you can see these logs? https://github.com/NVIDIA/gpu-operator/commit/4edef0f84b8418e553f1aaeb60bbe5f4e47120c3

To check if nvidia-vgpu-manager is running, you can run ps aux | grep nvidia-vgpu-manager on the host.

nadav213000 commented 1 year ago

I see. Can you try adding the following volume mount to the nvidia-vgpu-manager daemonset and check if you can see these logs? 4edef0f

I added the volume mount and I ran the following command:

journalctl -D /var/log/journal/ | grep -i vmiop

and there are three logs that repeat constantly:

Jul 17 02:46:22 <node> nvidia-vgpu-mgr[310831]: notice: vmiop_env_log: vmiop-env: guest_max_gpfn:0x0
Jul 17 02:46:22 <node> nvidia-vgpu-mgr[310831]: error: vmiop_env_log: (0x0): Failed to create vGPU character device with minor number 1 error 0x26
Jul 17 02:46:22 <node> nvidia-vgpu-mgr[310831]: error: vmiop_env_log: error: failed to notify VM start operation information: 59

Jul 17 02:47:31 <node> nvidia-vgpu-mgr[312671]: notice: vmiop_env_log: vmiop-env: guest_max_gpfn:0x0
Jul 17 02:47:31 <node> nvidia-vgpu-mgr[312671]: error: vmiop_env_log: (0x0): Failed to create vGPU character device with minor number 2 error 0x26
Jul 17 02:47:31 <node> nvidia-vgpu-mgr[312671]: error: vmiop_env_log: error: failed to notify VM start operation information: 59

it's the same node in all the logs

nadav213000 commented 1 year ago

There is also an error log at the beginning

Jul 16 10:36:59 <node> nvidia-vgpu-mgr[3175360]: error: vmiop_env_log: Failed to attach device: 0x26 (gpuId 0x4100)
Jul 16 10:37:05 <node> nvidia-vgpu-mgr[3175737]: notice: vmiop_env_log: nvidia-vgpu-mgr daemon started

cdesiniotis commented 1 year ago

Thanks @nadav213000. Can you try rebuilding the vgpu-manager container image from the latest changes in the driver container repository: https://gitlab.com/nvidia/container-images/driver

There was a bug fix that was recently merged into the vgpu-manager container scripts that may resolve this issue: https://gitlab.com/nvidia/container-images/driver/-/commit/94324dc6dbaff191b72a734b7734710110d82198

nadav213000 commented 1 year ago

Thanks @nadav213000. Can you try rebuilding the vgpu-manager container image from the latest changes in the driver container repository: https://gitlab.com/nvidia/container-images/driver

There was a bug fix that was recently merged into the vgpu-manager container scripts that may resolve this issue: https://gitlab.com/nvidia/container-images/driver/-/commit/94324dc6dbaff191b72a734b7734710110d82198

Thanks that solved the issue!

After that, I tried to deploy a Windows Server 2016 VM and assign the vGPU to it, but there was an error in the console that guest has not initialized the display (yet) as described in this issue https://github.com/kubevirt/kubevirt/issues/7245.

I changed the following in the VM spec:

devices:
          gpus:
            - deviceName: nvidia.com/NVIDIA_A10-12Q
              name: nvidia-10a
              virtualGPUOptions:
                display:
                  enabled: true
                  ramFB:
                    enabled: false

After setting the ramFB setting to false I could access the UI console in OpenShift.

After I installed the NVIDIA vGPU driver and configured its license I couldn't access the UI console anymore and it only showed a black screen.

Do you know why I had to set the ramFB to false for the VM console to be available? Should I increase the VM RAM?

And do you know why after installing the vGPU driver the console is no longer available or how can I investigate this problem?

cdesiniotis commented 1 year ago

@nadav213000 Can you try adding the below vGPU plugin option on the host?

echo "disable_vnc=0" > /sys/bus/mdev/devices/<vGPU uuid>/nvidia/vgpu_params

nadav213000 commented 1 year ago

@nadav213000 Can you try adding the below vGPU plugin option on the host?
echo "disable_vnc=0" > /sys/bus/mdev/devices/<vGPU uuid>/nvidia/vgpu_params

I have tried to run this command from the vgpu-manager-daemonset pod, but I get the following error:

sh: echo: write error: Operation not permitted

I also tried to run that command from the Node itself but got the same error.

How should I run this command?

nadav213000 commented 1 year ago

@nadav213000 Can you try adding the below vGPU plugin option on the host?
echo "disable_vnc=0" > /sys/bus/mdev/devices/<vGPU uuid>/nvidia/vgpu_params
I have tried to run this command from the vgpu-manager-daemonset pod, but I get the following error:
sh: echo: write error: Operation not permitted
I also tried to run that command from the Node itself but got the same error.

How should I run this command?

I had to stop the VM that using the vGPU before changing the vgpu_params values. But still, after changing the value, I still see a black screen in the OpenShift Console.

Can I provide you with some logs to help identify the issue?

cdesiniotis commented 1 year ago

@nadav213000 are you using the desktop viewer option (RDP) in OSV? This is the recommended method for accessing Window's VMs, as documented here https://docs.openshift.com/container-platform/4.12/virt/virtual_machines/virt-accessing-vm-consoles.html. After installing the vGPU guest driver, the VNC consoles may not show anything. I have been told RDP should work.

NVIDIA / gpu-operator