NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.77k stars 286 forks source link

VirtualGL with NVIDIA GPU Operator in EKS (Invalid EGL device) #714

Open Mohamed-ben-khemis opened 5 months ago

Mohamed-ben-khemis commented 5 months ago

Troubleshooting VirtualGL with NVIDIA GPU Operator in EKS

Issue Summary

Encountering issues with VirtualGL failing to detect GPUs within my EKS (Amazon Elastic Kubernetes Service) cluster using the NVIDIA GPU Operator. Despite confirming GPU presence with nvidia-smi, running glxgears with GPU acceleration using vglrun results in the following error:

vglrun -d /dev/nvidia0 glxgears
[VGL] ERROR: in init3D--
[VGL] 228: Invalid EGL device

Details

Issue

VirtualGL (vglrun) fails to initialize the 3D environment (glxgears) with an "Invalid EGL device" error when attempting GPU acceleration.

Questions

  1. How can I troubleshoot and resolve the issue of VirtualGL failing to detect and utilize GPUs within my container environment?
  2. Are there additional configurations or dependencies required to enable GPU acceleration with VirtualGL on EKS using the NVIDIA GPU Operator?

Additional Information

@ubuntu-fk5a8-91b4d208t9nxv:/etc/X11/xorg.conf.d$ nvidia-smi 
Fri May  3 11:14:58 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       On  |   00000000:00:1E.0 Off |                    0 |
| N/A   25C    P8             14W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
shivamerla commented 4 months ago

@Mohamed-ben-khemis Can you run "kubectl get pods -n gpu-operator" to confirm that the driver is run from the operator? We don't install openGL libraries today from the driver-container. @elezar do you see any issues with the container-toolkit injecting necessary config files in this case?

Mohamed-ben-khemis commented 2 months ago

@shivamerla Here are the results from running kubectl get pods -n gpu-operator:

image