IDV crash with EGL - Githubissues

mcuma commented 2 years ago

Hello,

we have been running IDV (Integrated Data Viewer), https://www.unidata.ucar.edu/software/idv/, with VirtualGL over GLX successfully for a while.

Now we are setting up a new system running Rocky Linux 8 with EGL, and IDV is crashing at the startup. We are wondering if this is due to the incomplete implementation of the EGL, or something else. And, if the former, if this is something that could be fixed in the foreseeable future.

The crash is such that the IDV splash window shows OK, followed by a short (a second) view of the main window, followed by a crash with the following message:

libEGL warning: DRI3: failed to query the version
libEGL warning: DRI2: failed to authenticate
com.jogamp.opengl.GLException: J3D-Renderer-1: createImpl ARB n/a but required, profile > GL2 requested (OpenGL >= 3.1). Requested: GLProfile[GL4bc/GL4bc.hw], current: 4.6 (Compat profile, compat[ES2, ES3, ES31, ES32], FBO, hardware) - 4.6.0 NVIDIA 510.39.01
    at jogamp.opengl.x11.glx.X11GLXContext.createImpl(X11GLXContext.java:440)
    at jogamp.opengl.GLContextImpl.makeCurrentWithinLock(GLContextImpl.java:770)
    at jogamp.opengl.GLContextImpl.makeCurrent(GLContextImpl.java:653)
    at jogamp.opengl.GLContextImpl.makeCurrent(GLContextImpl.java:591)
    at javax.media.j3d.JoglPipeline$QueryCanvas.doQuery(JoglPipeline.java:8623)
    at javax.media.j3d.JoglPipeline$QueryCanvas.access$100(JoglPipeline.java:8574)
    at javax.media.j3d.JoglPipeline.createQueryContext(JoglPipeline.java:6561)
    at javax.media.j3d.Canvas3D.createQueryContext(Canvas3D.java:4619)
    at javax.media.j3d.Canvas3D.createQueryContext(Canvas3D.java:3616)
    at javax.media.j3d.Renderer.doWork(Renderer.java:461)
    at javax.media.j3d.J3dThread.run(J3dThread.java:271)

DefaultRenderingErrorListener.errorOccurred:
CONTEXT_CREATION_ERROR: Renderer: Error creating Canvas3D graphics context for queryProperties()
graphicsDevice = X11GraphicsDevice[screen=0]
canvas = javax.media.j3d.Canvas3D[canvas0,0,0,0x0,invalid]
This version of Java3D can't query "textureWidthMax/textureHeightMax"
so they are being assigned the default values: 
textureWidthMax:  1024
textureHeightMax:  1024
If images render as a 'grey-box', try setting these parameters
to a lower value, eg. 512, with '-DtextureWidthMax=512'
Otherwise check your graphics environment specifications
X11Util.Display: Shutdown (JVM shutdown: true, open (no close attempt): 2/2, reusable (open, marked uncloseable): 0, pending (open in creation order): 2)
X11Util: Open X11 Display Connections: 2
X11Util: Open[0]: NamedX11Display[10.242.129.51:14.0, 0x7f39f000ee50, refCount 1, unCloseable false]
X11Util: Open[1]: NamedX11Display[10.242.129.51:14.0, 0x7f39f05d5de0, refCount 1, unCloseable false]

Other than that, thanks for providing great software.

dcommander commented 2 years ago

I will look into it. There are several known issues with the EGL back end, and it has been difficult to track them down, since most of them affect commercial applications that I have no access to. This issue fortunately affects an application that is publicly available.

mcuma commented 2 years ago

Great, thanks for quick reply. This should be fairly easy to reproduce, download the binary (that comes with Java runtime), and run as "runIDV"

mcuma commented 2 years ago

I should have said, run as "vglrun runIDV".

dcommander commented 2 years ago

Weird. I can't reproduce the crash. But it is interesting to note that another user experienced a crash with Rocky Linux, TecPlot360, and the EGL back end that I was also unable to reproduce, even on CentOS Stream with the same nVidia driver version. I am working with that user to diagnose the crash, so it's possible that this is the same issue. Maybe I need to switch my test machine to Rocky Linux.

mcuma commented 2 years ago

Interesting to hear that it works for you. What OS do you run where it works? I'll try that with a container to see if that works for us. And what is your Nvidia driver version? Ours is 510.39.01. There were some older IDV discussion board messages talking about GPU driver issues with similar error messages, so it is a possibility.

dcommander commented 2 years ago

I converted my CentOS Stream box to Rocky Linux and made sure all of the packages are synchronized with the latest in the Rocky Linux 8.5 repositories. Unfortunately, I still cannot reproduce the failure. This specific box has a Quadro P620 with driver version 510.60.02.

The libEGL warnings make me suspicious. The first thing I would try is re-installing the nVidia drivers. Perhaps a system upgrade overwrote nVidia's proprietary libEGL implementation with the Mesa implementation.

mcuma commented 2 years ago

Thanks, looks like you are on a slightly newer GPU driver. Would you mind running "ls -la /usr/lib64/libEGL* " on your system so we could compare that to the libs that we have? We'll be looking into the drivers re-install as well. Thanks.

dcommander commented 2 years ago

lrwxrwxrwx. 1 root root      20 Nov  9 15:43 /usr/lib64/libEGL_mesa.so.0 -> libEGL_mesa.so.0.0.0
-rwxr-xr-x. 1 root root  269080 Nov  9 15:44 /usr/lib64/libEGL_mesa.so.0.0.0
lrwxrwxrwx. 1 root root      26 Mar 31 09:58 /usr/lib64/libEGL_nvidia.so.0 -> libEGL_nvidia.so.510.60.02
-rwxr-xr-x. 1 root root 1316880 Mar 31 09:58 /usr/lib64/libEGL_nvidia.so.510.60.02
lrwxrwxrwx. 1 root root      15 May 18  2021 /usr/lib64/libEGL.so -> libEGL.so.1.1.0
lrwxrwxrwx. 1 root root      15 May 18  2021 /usr/lib64/libEGL.so.1 -> libEGL.so.1.1.0
-rwxr-xr-x. 1 root root   84760 May 18  2021 /usr/lib64/libEGL.so.1.1.0

dcommander commented 2 years ago

Actually, now that I think about it, this distribution uses libglvnd, so a system upgrade shouldn't overwrite the nVidia-installed libEGL implementation. Still, though, it seems like your system might have an issue with its OpenGL libraries.

bdhaymore commented 2 years ago

I am working with Martin on this. I don't think we have had the EGL libs get overwritten from what I can see.

ls -la /usr/lib64/libEGL* lrwxrwxrwx 1 root root 20 Nov 9 14:43 /usr/lib64/libEGL_mesa.so.0 -> libEGL_mesa.so.0.0.0 -rwxr-xr-x 1 root root 269080 Nov 9 14:44 /usr/lib64/libEGL_mesa.so.0.0.0 lrwxrwxrwx 1 root root 26 Jan 24 20:25 /usr/lib64/libEGL_nvidia.so.0 -> libEGL_nvidia.so.510.47.03 -rwxr-xr-x 1 root root 1316880 Jan 24 15:49 /usr/lib64/libEGL_nvidia.so.510.47.03 lrwxrwxrwx 1 root root 15 May 18 2021 /usr/lib64/libEGL.so -> libEGL.so.1.1.0 lrwxrwxrwx 1 root root 15 May 18 2021 /usr/lib64/libEGL.so.1 -> libEGL.so.1.1.0 -rwxr-xr-x 1 root root 84760 May 18 2021 /usr/lib64/libEGL.so.1.1.0

Also your note of driver 510.60.02 suggests your getting your driver from something other than the nvidia cuda yum repo, where we are pulling cuda and the driver from. The latest version in there is 510.47.03 and that is what we are on. I am running a reinstall of the nvidia cuda and driver stack on a test system for Martin to try just to be sure on your suggestion.

dcommander commented 2 years ago

I am getting the driver directly from nvidia.com. I do not use Cuda.

bdhaymore commented 2 years ago

I am as well, from nvidia, just from a yum repo: cat cuda.repo [cuda] name=cuda baseurl=http://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64 enabled=1 gpgcheck=1 gpgkey=http://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/7fa2af80.pub obsoletes=0

karlkleinpaste commented 2 years ago

rpmfusion has 510.60.02; nvidia cuda has 510.47.03.

dcommander commented 2 years ago

Does the Cuda driver differ from the plain GPU driver? Specifically, here is the one I installed: https://us.download.nvidia.com/XFree86/Linux-x86_64/510.60.02/NVIDIA-Linux-x86_64-510.60.02.run

dcommander commented 2 years ago

I tried the Cuda driver you linked to above. I still cannot reproduce the issue.

mcuma commented 2 years ago

Thanks for the info. It's hard to debug if you can't reproduce it. We do have a workaround, by using both the OpenGL and the EGL on the machines that run X, so, I am tempted to just leave this alone for now.

dcommander commented 2 years ago

Can you explain more about the workaround? I’m not sure what you mean.

mcuma commented 2 years ago

Brian may explain this better, but, from what I understand, he installed both the OpenGL and EGL into the VirtualGL. So, if I run "vglrun -d :0.0 -c proxy", it'll use the OpenGL on a system that's running X, and if I do "vglrun -c proxy" it'll use the EGL. So, we use the former to run the IDV on just X enabled systems, and don't use VirtualGL on systems that don't run X (our cluster compute nodes). That's good enough since the compute nodes don't have any good graphic cards anyway. The hope of course was to use the same thing for all the systems, but, it's a relatively simple condition in a launch script to separate these two cases.

dcommander commented 2 years ago

Your terminology is confusing. I don't understand what "he installed both the OpenGL and EGL into the VirtualGL" means. Are you referring to the GLX and EGL back ends? You seem to be suggesting that your system is configured to use the EGL back end by default but that you are continuing to use the GLX back end only for IDV. If so, then unfortunately that doesn't do anything to help me solve the actual problem. :(

mcuma commented 2 years ago

Yes, I meant GLX, sorry about the mixup. Using the GLX is a workaround, obviously, but it enables us to run the IDV. Given that you can't reproduce the problem the only thing I can think of to help with solving this is to provide you access to our systems, unless Brian has some other idea.

dcommander commented 2 years ago

I am more than happy to diagnose the issue remotely via SSH.

mcuma commented 2 years ago

OK, thanks, appreciate your willingness to do this. Let me discuss this with our leadership and I'll get back next week, as you may imagine access to HPC resource like ours has some associated security.

mcuma commented 2 years ago

Hi, thanks again for your willingness to diagnose this issue on our systems. To get you an access, we need some personal information. Can you please send me a message to my work e-mail, m.cuma at utah.edu. Thanks.

dcommander commented 2 years ago

This was apparently already fixed in 2bcdb66e930ac8785f363e2ae1fed054047e88da. Please try installing the latest 3.0.x pre-release: https://virtualgl.org/DeveloperInfo/PreReleases.

VirtualGL / virtualgl

IDV crash with EGL #194