Closed mcuma closed 2 years ago
I will look into it. There are several known issues with the EGL back end, and it has been difficult to track them down, since most of them affect commercial applications that I have no access to. This issue fortunately affects an application that is publicly available.
Great, thanks for quick reply. This should be fairly easy to reproduce, download the binary (that comes with Java runtime), and run as "runIDV"
I should have said, run as "vglrun runIDV".
Weird. I can't reproduce the crash. But it is interesting to note that another user experienced a crash with Rocky Linux, TecPlot360, and the EGL back end that I was also unable to reproduce, even on CentOS Stream with the same nVidia driver version. I am working with that user to diagnose the crash, so it's possible that this is the same issue. Maybe I need to switch my test machine to Rocky Linux.
Interesting to hear that it works for you. What OS do you run where it works? I'll try that with a container to see if that works for us. And what is your Nvidia driver version? Ours is 510.39.01. There were some older IDV discussion board messages talking about GPU driver issues with similar error messages, so it is a possibility.
I converted my CentOS Stream box to Rocky Linux and made sure all of the packages are synchronized with the latest in the Rocky Linux 8.5 repositories. Unfortunately, I still cannot reproduce the failure. This specific box has a Quadro P620 with driver version 510.60.02.
The libEGL warnings make me suspicious. The first thing I would try is re-installing the nVidia drivers. Perhaps a system upgrade overwrote nVidia's proprietary libEGL implementation with the Mesa implementation.
Thanks, looks like you are on a slightly newer GPU driver. Would you mind running "ls -la /usr/lib64/libEGL* " on your system so we could compare that to the libs that we have? We'll be looking into the drivers re-install as well. Thanks.
lrwxrwxrwx. 1 root root 20 Nov 9 15:43 /usr/lib64/libEGL_mesa.so.0 -> libEGL_mesa.so.0.0.0
-rwxr-xr-x. 1 root root 269080 Nov 9 15:44 /usr/lib64/libEGL_mesa.so.0.0.0
lrwxrwxrwx. 1 root root 26 Mar 31 09:58 /usr/lib64/libEGL_nvidia.so.0 -> libEGL_nvidia.so.510.60.02
-rwxr-xr-x. 1 root root 1316880 Mar 31 09:58 /usr/lib64/libEGL_nvidia.so.510.60.02
lrwxrwxrwx. 1 root root 15 May 18 2021 /usr/lib64/libEGL.so -> libEGL.so.1.1.0
lrwxrwxrwx. 1 root root 15 May 18 2021 /usr/lib64/libEGL.so.1 -> libEGL.so.1.1.0
-rwxr-xr-x. 1 root root 84760 May 18 2021 /usr/lib64/libEGL.so.1.1.0
Actually, now that I think about it, this distribution uses libglvnd, so a system upgrade shouldn't overwrite the nVidia-installed libEGL implementation. Still, though, it seems like your system might have an issue with its OpenGL libraries.
I am working with Martin on this. I don't think we have had the EGL libs get overwritten from what I can see.
ls -la /usr/lib64/libEGL* lrwxrwxrwx 1 root root 20 Nov 9 14:43 /usr/lib64/libEGL_mesa.so.0 -> libEGL_mesa.so.0.0.0 -rwxr-xr-x 1 root root 269080 Nov 9 14:44 /usr/lib64/libEGL_mesa.so.0.0.0 lrwxrwxrwx 1 root root 26 Jan 24 20:25 /usr/lib64/libEGL_nvidia.so.0 -> libEGL_nvidia.so.510.47.03 -rwxr-xr-x 1 root root 1316880 Jan 24 15:49 /usr/lib64/libEGL_nvidia.so.510.47.03 lrwxrwxrwx 1 root root 15 May 18 2021 /usr/lib64/libEGL.so -> libEGL.so.1.1.0 lrwxrwxrwx 1 root root 15 May 18 2021 /usr/lib64/libEGL.so.1 -> libEGL.so.1.1.0 -rwxr-xr-x 1 root root 84760 May 18 2021 /usr/lib64/libEGL.so.1.1.0
Also your note of driver 510.60.02 suggests your getting your driver from something other than the nvidia cuda yum repo, where we are pulling cuda and the driver from. The latest version in there is 510.47.03 and that is what we are on. I am running a reinstall of the nvidia cuda and driver stack on a test system for Martin to try just to be sure on your suggestion.
I am getting the driver directly from nvidia.com. I do not use Cuda.
I am as well, from nvidia, just from a yum repo: cat cuda.repo [cuda] name=cuda baseurl=http://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64 enabled=1 gpgcheck=1 gpgkey=http://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/7fa2af80.pub obsoletes=0
rpmfusion has 510.60.02; nvidia cuda has 510.47.03.
Does the Cuda driver differ from the plain GPU driver? Specifically, here is the one I installed: https://us.download.nvidia.com/XFree86/Linux-x86_64/510.60.02/NVIDIA-Linux-x86_64-510.60.02.run
I tried the Cuda driver you linked to above. I still cannot reproduce the issue.
Thanks for the info. It's hard to debug if you can't reproduce it. We do have a workaround, by using both the OpenGL and the EGL on the machines that run X, so, I am tempted to just leave this alone for now.
Can you explain more about the workaround? I’m not sure what you mean.
Brian may explain this better, but, from what I understand, he installed both the OpenGL and EGL into the VirtualGL. So, if I run "vglrun -d :0.0 -c proxy", it'll use the OpenGL on a system that's running X, and if I do "vglrun -c proxy" it'll use the EGL. So, we use the former to run the IDV on just X enabled systems, and don't use VirtualGL on systems that don't run X (our cluster compute nodes). That's good enough since the compute nodes don't have any good graphic cards anyway. The hope of course was to use the same thing for all the systems, but, it's a relatively simple condition in a launch script to separate these two cases.
Your terminology is confusing. I don't understand what "he installed both the OpenGL and EGL into the VirtualGL" means. Are you referring to the GLX and EGL back ends? You seem to be suggesting that your system is configured to use the EGL back end by default but that you are continuing to use the GLX back end only for IDV. If so, then unfortunately that doesn't do anything to help me solve the actual problem. :(
Yes, I meant GLX, sorry about the mixup. Using the GLX is a workaround, obviously, but it enables us to run the IDV. Given that you can't reproduce the problem the only thing I can think of to help with solving this is to provide you access to our systems, unless Brian has some other idea.
I am more than happy to diagnose the issue remotely via SSH.
OK, thanks, appreciate your willingness to do this. Let me discuss this with our leadership and I'll get back next week, as you may imagine access to HPC resource like ours has some associated security.
Hi, thanks again for your willingness to diagnose this issue on our systems. To get you an access, we need some personal information. Can you please send me a message to my work e-mail, m.cuma at utah.edu. Thanks.
This was apparently already fixed in 2bcdb66e930ac8785f363e2ae1fed054047e88da. Please try installing the latest 3.0.x pre-release: https://virtualgl.org/DeveloperInfo/PreReleases.
Hello,
we have been running IDV (Integrated Data Viewer), https://www.unidata.ucar.edu/software/idv/, with VirtualGL over GLX successfully for a while.
Now we are setting up a new system running Rocky Linux 8 with EGL, and IDV is crashing at the startup. We are wondering if this is due to the incomplete implementation of the EGL, or something else. And, if the former, if this is something that could be fixed in the foreseeable future.
The crash is such that the IDV splash window shows OK, followed by a short (a second) view of the main window, followed by a crash with the following message:
Other than that, thanks for providing great software.