Closed DevinBayly closed 4 years ago
Please look up and understand what VGL_DISPLAY
/vglrun -d
actually does. It should never be set to the value of $DISPLAY
.
Thanks, I'll go read that immediately.
It does sound like vglrun -d
is for multiple gpu's and multiple X screens so it doesn't relate to $DISPLAY
, I understand that now.
So is there a problem with something in our setup prior to the incorrect vglrun -d
?
vglrun glxinfo
on its own using the initial configuration info and access steps from the first post leads to
vglrun glxinfo
name of display: localhost:10.0
[VGL] ERROR: Could not open display :0.
Should I create a separate issue to pursue this problem further since this issue is closed?
Closing an issue just means that the issue was resolved or was not something that needs to be addressed in VirtualGL. Continuing to discuss the same topic in the comments is OK, even if the issue is closed.
That error message means that either the 3D X server isn't running or VirtualGL can't access it. Perform the "Sanity Check" procedure described here: https://cdn.rawgit.com/VirtualGL/virtualgl/2.6.4/doc/index.html#hd006002001
Some common reasons why the 3D X server may not be accessible:
vglserver_config
was never run on the VirtualGL server.vglserver_config
prompted you to disable Wayland in the display manager.vglserver_config
and never restarted.)vglusers
group, but your user account isn't in the vglusers
group (or you did not log out and back in after adding your user account to the group.) Make sure you can read /etc/opt/VirtualGL/vgl_xauth_key.it was the X server that wasn't in existence as I've been told by folks in other places.
Currently stumped by the process of removing the nvidia module causing this error
... Attempting to remove nvidia module from memory so device permissions
will be reloaded ...
rmmod: ERROR: Module nvidia is in use by
but this isn't a VirtualGL issue
Thanks for the clarification in your last reply either way!
That error just means you might need to reboot in order for the nVidia device permissions to be correct for shared use with VGL.
Oh excellent, I'll give that a try!
You wouldn't be surprised to see the error
vglrun -d /dev/dri/card0 glxinfo
[VGL] ERROR: in init3D--
[VGL] 219: Could not open EGL display
if all the permissions on /dev/dri/card0 and /dev/nvidia* were +rw, but the nVidia device permissions hadn't taken effect right?
I’m not sure. In my testing, I didn’t notice that the DRI devices depended on the /dev/nvidia permissions at all, but it might be system-specific or GPU-specific. Regardless, if the issue persists after reboot, then we can look into it. The main thing with the EGL back end is that both /dev/dri/card and /dev/dri/render need to have correct permissions, and that’s why you need to run the version of vglserver_config in the pre-release build. The 2.6.x version of vglserver_config only sets permissions for /dev/dri/card.
I'm working with the infrastructure team this afternoon to double check the reboot result, I'll let you know then. Thanks for the suggestion!
We will check the permissionss on the /dev/dri/render* files also. The version of the vglserver_config we used was what I found under the linux packages on this page https://virtualgl.org/DeveloperInfo/PreReleases under the dev branch evolving 3.0 section. This directed to the s3 bucket and we installed the VirtualGL-2.6.80.x86_64.rpm. Since this is still a 2.6.x version is there somewhere else I should look for the pre-release build?
Sorry for the confusion. When I say 2.6.x, I mean 2.6.x stable. 2.6.80 is 3.0 alpha, which is not considered production-ready at the moment. It is an early access build.
no worries, glad we have the right version.
The infrastructure date got bumped to monday so I'll report back then. Thanks for the help!
I should mention I noticed something that I missed before, the only contents of /dev/dri
is card0
there are no render*
files. I would believe this is causing problems, but I have to research what they are or why we don't have any on the machine.
I just tried
vglrun +v -d /dev/dri/card0 glxgears
[VGL] Shared memory segment ID for vglconfig: 2
[VGL] VirtualGL v2.6.80 64-bit (Build 20200917)
[VGL] Opening EGL device /dev/dri/card0
[VGL] ERROR: in init3D--
[VGL] 219: Could not open EGL display
This is what I was seeing before when we weren't sure if the vglserver_config
had worked because we saw the error about the nvidia module not being removed properly.
Since then the infrastructure team has run the following
2 2020/09/25 14:36:19 rpm -ivh VirtualGL-2.6.80.x86_64.rpm
3 2020/09/25 14:37:09 vglserver_config
4 2020/09/25 14:37:34 reboot
but it appears the reboot wasn't the part that was missing. Should I make a separate issue for this? Thanks for all your work and assistance.
I don't know what a second issue would accomplish. I have no clue why there are no **/dev/render*** files on your machine. That isn't our bug, and you are asking for support on a feature (the EGL back end) that is not even in beta yet. If you want to pay me as a consultant to diagnose the problem, then I'm happy to do that, but my free support is limited to fixing confirmed bugs in VirtualGL.
Sorry, I didn't mean to suggest the second issue would have anything to do with the missing render files, and I agree that's not your bug. I mentioned it mostly just to see if info related to the EGL backend should stay in this issue thread. That said, I think there's more I need to look into and I will reply if I have something that falls better under the category of your free support. Take care!
hello there. I meet some similiar problems .the VGL on my server can only work without nvidia hardware.it use that llvm . howerver display port :1 can use the nvidia driver.but the port created by turbovnc can only use integrated video card. my system is ubuntu 18.04.6 using gdm3 and the graphic card is 10 2080ti. the driver version is about 450 .The vgl version is 2.6.5. when I follow the guidence. I cant find the file "vgl_xauth_key" anyway.would it be the key problem? looking forward to your reply~
@crazyleeth Please do not hijack other issues, particularly issues that are closed and which may or may not be related to yours. Post a new issue.
Hi there,
Yesterday I posted a similar issue but was trying to use a singularity container and xpra to use VirtualGL with an nvidia card on the HPC where I work. Today I'm trying to strip away the complicating factors, so I'm no longer using any containers or xpra.
I'm still having trouble, and feel like some step is getting left out. Please let me know if anything obvious is missing.
With the infrastructure team we walked through the steps laid out on https://virtualgl.org/Documentation/HeadlessNV. We use a module system for loading the cuda drivers
module load cuda10.1
, and after thatnvidia-smi
produces outputfor the following cardthen after following the headless mini instructions
/etc/X11/xorg.conf
looks like the followingWe then configured the VirtualGL 3D X server following these steps https://cdn.rawgit.com/VirtualGL/virtualgl/2.6.3/doc/index.html#hd006002001 granting access to the 3D X server. We ran
init 3
, as this is a centos 7 machine, and then/usr/bin/vglserver_config
We supplied the answer N to each of the config steps, and got this message at the end. We weren't sure what to make of the error, or how to get around it.I then performed the multi jump as suggested here #15, where I added vglconnect vgllogin and nettest to the gateway and the login node, and from my laptop client ran
vglconnect -s -bindir /home/u4/myuser
then from the gateway to the login node specifying the place where the login node vgl files were copiedvglconnect -s -bindir /home/u4/myuser
then connected to the active job where thevglserver_config
happened and the/etc/X11/xorg.conf
lives.my
$DISPLAY
variable waslocalhost:10.0
so I triedvglrun -d localhost:10.0 glxinfo
but still seeYour help is greatly appreciated!