Nvidia hardware on HPC ignored when using VirtualGL

DevinBayly commented 4 years ago

Hi there,

Yesterday I posted a similar issue but was trying to use a singularity container and xpra to use VirtualGL with an nvidia card on the HPC where I work. Today I'm trying to strip away the complicating factors, so I'm no longer using any containers or xpra.

I'm still having trouble, and feel like some step is getting left out. Please let me know if anything obvious is missing.

With the infrastructure team we walked through the steps laid out on https://virtualgl.org/Documentation/HeadlessNV. We use a module system for loading the cuda drivers module load cuda10.1, and after that nvidia-smi produces outputfor the following card

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   20C    P0    23W / 250W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

then after following the headless mini instructions /etc/X11/xorg.conf looks like the following

# nvidia-xconfig: X configuration file generated by nvidia-xconfig
# nvidia-xconfig:  version 440.64.00
Section "DRI"
    Mode 0666
EndSection

Section "ServerLayout"
    Identifier     "Layout0"
    Screen      0  "Screen0"
    InputDevice    "Keyboard0" "CoreKeyboard"
    InputDevice    "Mouse0" "CorePointer"
EndSection

Section "Files"
EndSection

Section "InputDevice"
    # generated from default
    Identifier     "Mouse0"
    Driver         "mouse"
    Option         "Protocol" "auto"
    Option         "Device" "/dev/input/mice"
    Option         "Emulate3Buttons" "no"
    Option         "ZAxisMapping" "4 5"
EndSection

Section "InputDevice"
    # generated from default
    Identifier     "Keyboard0"
    Driver         "kbd"
EndSection

Section "Monitor"
    Identifier     "Monitor0"
    VendorName     "Unknown"
    ModelName      "Unknown"
    Option         "DPMS"
EndSection

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "Tesla V100-PCIE-32GB"
    BusID          "PCI:11:0:0"
EndSection
Option "HardDPMS" "false"

Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
    Monitor        "Monitor0"
    DefaultDepth    24
    Option         "AllowEmptyInitialConfiguration" "True"
    SubSection     "Display"
        Virtual     1920 1200
        Depth       24
    EndSubSection
EndSection

We then configured the VirtualGL 3D X server following these steps https://cdn.rawgit.com/VirtualGL/virtualgl/2.6.3/doc/index.html#hd006002001 granting access to the 3D X server. We ran init 3, as this is a centos 7 machine, and then /usr/bin/vglserver_config We supplied the answer N to each of the config steps, and got this message at the end. We weren't sure what to make of the error, or how to get around it.

... Modifying /etc/security/console.perms to disable automatic permissions
    for DRI devices ...
... Creating /etc/modprobe.d/virtualgl.conf to set requested permissions for
    /dev/nvidia* ...
... Attempting to remove nvidia module from memory so device permissions
    will be reloaded ...
rmmod: ERROR: Module nvidia is in use by: nvidia_uvm
... Granting write permission to /dev/nvidia0 /dev/nvidiactl /dev/nvidia-uvm /dev/nvidia-uvm-tools for all users ...
... Granting write permission to /dev/dri/card0 for all users ...
... Modifying /etc/X11/xorg.conf.d/99-virtualgl-dri to enable DRI permissions
    for all users ...
... Modifying /etc/X11/xorg.conf to enable DRI permissions
    for all users ...
... Setting default run level to 5 (enabling graphical login prompt) ...

I then performed the multi jump as suggested here #15, where I added vglconnect vgllogin and nettest to the gateway and the login node, and from my laptop client ran vglconnect -s -bindir /home/u4/myuser then from the gateway to the login node specifying the place where the login node vgl files were copied vglconnect -s -bindir /home/u4/myuser then connected to the active job where the vglserver_config happened and the /etc/X11/xorg.conf lives.

my $DISPLAY variable was localhost:10.0 so I tried vglrun -d localhost:10.0 glxinfo but still see

client glx vendor string: VirtualGL
client glx version string: 1.4
client glx extensions:
...
server glx vendor string: VirtualGL
server glx version string: 1.4
server glx extensions:
...

OpenGL vendor string: VMware, Inc.
OpenGL renderer string: llvmpipe (LLVM 7.0, 256 bits)
OpenGL version string: 2.1 Mesa 18.3.4
OpenGL shading language version string: 1.20
OpenGL extensions:

Your help is greatly appreciated!

dcommander commented 4 years ago

Please look up and understand what VGL_DISPLAY/vglrun -d actually does. It should never be set to the value of $DISPLAY.

DevinBayly commented 4 years ago

Thanks, I'll go read that immediately.

DevinBayly commented 4 years ago

It does sound like vglrun -d is for multiple gpu's and multiple X screens so it doesn't relate to $DISPLAY, I understand that now.

So is there a problem with something in our setup prior to the incorrect vglrun -d ?

vglrun glxinfo on its own using the initial configuration info and access steps from the first post leads to

vglrun glxinfo
name of display: localhost:10.0
[VGL] ERROR: Could not open display :0.

Should I create a separate issue to pursue this problem further since this issue is closed?

dcommander commented 4 years ago

Closing an issue just means that the issue was resolved or was not something that needs to be addressed in VirtualGL. Continuing to discuss the same topic in the comments is OK, even if the issue is closed.

That error message means that either the 3D X server isn't running or VirtualGL can't access it. Perform the "Sanity Check" procedure described here: https://cdn.rawgit.com/VirtualGL/virtualgl/2.6.4/doc/index.html#hd006002001

Some common reasons why the 3D X server may not be accessible:

vglserver_config was never run on the VirtualGL server.
The VirtualGL server supports Wayland, but you did not answer "yes" when vglserver_config prompted you to disable Wayland in the display manager.
The 3D X server isn't actually running (perhaps due to an X server issue-- check /var/log/Xorg.0.log for errors-- or simply because it was stopped in order to run vglserver_config and never restarted.)
You restricted 3D X server access to the vglusers group, but your user account isn't in the vglusers group (or you did not log out and back in after adding your user account to the group.) Make sure you can read /etc/opt/VirtualGL/vgl_xauth_key.

DevinBayly commented 4 years ago

it was the X server that wasn't in existence as I've been told by folks in other places.

Currently stumped by the process of removing the nvidia module causing this error

... Attempting to remove nvidia module from memory so device permissions
will be reloaded ...
rmmod: ERROR: Module nvidia is in use by

but this isn't a VirtualGL issue

Thanks for the clarification in your last reply either way!

dcommander commented 4 years ago

That error just means you might need to reboot in order for the nVidia device permissions to be correct for shared use with VGL.

DevinBayly commented 4 years ago

Oh excellent, I'll give that a try!

You wouldn't be surprised to see the error

vglrun -d /dev/dri/card0 glxinfo
[VGL] ERROR: in init3D--
[VGL]    219: Could not open EGL display

if all the permissions on /dev/dri/card0 and /dev/nvidia* were +rw, but the nVidia device permissions hadn't taken effect right?

dcommander commented 4 years ago

I’m not sure. In my testing, I didn’t notice that the DRI devices depended on the /dev/nvidia permissions at all, but it might be system-specific or GPU-specific. Regardless, if the issue persists after reboot, then we can look into it. The main thing with the EGL back end is that both /dev/dri/card and /dev/dri/render need to have correct permissions, and that’s why you need to run the version of vglserver_config in the pre-release build. The 2.6.x version of vglserver_config only sets permissions for /dev/dri/card.

DevinBayly commented 4 years ago

I'm working with the infrastructure team this afternoon to double check the reboot result, I'll let you know then. Thanks for the suggestion!

We will check the permissionss on the /dev/dri/render* files also. The version of the vglserver_config we used was what I found under the linux packages on this page https://virtualgl.org/DeveloperInfo/PreReleases under the dev branch evolving 3.0 section. This directed to the s3 bucket and we installed the VirtualGL-2.6.80.x86_64.rpm. Since this is still a 2.6.x version is there somewhere else I should look for the pre-release build?

dcommander commented 4 years ago

Sorry for the confusion. When I say 2.6.x, I mean 2.6.x stable. 2.6.80 is 3.0 alpha, which is not considered production-ready at the moment. It is an early access build.

DevinBayly commented 4 years ago

no worries, glad we have the right version.

The infrastructure date got bumped to monday so I'll report back then. Thanks for the help!

DevinBayly commented 4 years ago

I should mention I noticed something that I missed before, the only contents of /dev/dri is card0 there are no render* files. I would believe this is causing problems, but I have to research what they are or why we don't have any on the machine.

I just tried

vglrun +v -d /dev/dri/card0 glxgears
[VGL] Shared memory segment ID for vglconfig: 2
[VGL] VirtualGL v2.6.80 64-bit (Build 20200917)
[VGL] Opening EGL device /dev/dri/card0
[VGL] ERROR: in init3D--
[VGL]    219: Could not open EGL display

This is what I was seeing before when we weren't sure if the vglserver_config had worked because we saw the error about the nvidia module not being removed properly.

Since then the infrastructure team has run the following

2  2020/09/25 14:36:19 rpm -ivh VirtualGL-2.6.80.x86_64.rpm
3  2020/09/25 14:37:09 vglserver_config
4  2020/09/25 14:37:34 reboot

but it appears the reboot wasn't the part that was missing. Should I make a separate issue for this? Thanks for all your work and assistance.

dcommander commented 4 years ago

I don't know what a second issue would accomplish. I have no clue why there are no **/dev/render*** files on your machine. That isn't our bug, and you are asking for support on a feature (the EGL back end) that is not even in beta yet. If you want to pay me as a consultant to diagnose the problem, then I'm happy to do that, but my free support is limited to fixing confirmed bugs in VirtualGL.

DevinBayly commented 4 years ago

Sorry, I didn't mean to suggest the second issue would have anything to do with the missing render files, and I agree that's not your bug. I mentioned it mostly just to see if info related to the EGL backend should stay in this issue thread. That said, I think there's more I need to look into and I will reply if I have something that falls better under the category of your free support. Take care!

crazyleeth commented 3 years ago

hello there. I meet some similiar problems .the VGL on my server can only work without nvidia hardware.it use that llvm . howerver display port :1 can use the nvidia driver.but the port created by turbovnc can only use integrated video card. my system is ubuntu 18.04.6 using gdm3 and the graphic card is 10 2080ti. the driver version is about 450 .The vgl version is 2.6.5. when I follow the guidence. I cant find the file "vgl_xauth_key" anyway.would it be the key problem? looking forward to your reply~

dcommander commented 3 years ago

@crazyleeth Please do not hijack other issues, particularly issues that are closed and which may or may not be related to yours. Post a new issue.

VirtualGL / virtualgl

Nvidia hardware on HPC ignored when using VirtualGL #138