Closed peci1 closed 1 year ago
Why it broken in Fortress/Ogre2.2 I do not know.
What I can tell you, is that when "currentGLContext" = "1" (which is what Gazebo does in both Edifice & Fortress) both Ogre 2.1 and 2.2 will perform the following:
if( "currentGLContext" == "1" )
{
if( !glXGetCurrentContext() )
{
OGRE_EXCEPT( Exception::ERR_RENDERINGAPI_ERROR,
"currentGLContext was specified with no current GL context",
"GLXWindow::create" );
}
}
glXGetCurrentContext
is a GLX call.
The algorithm is quite simple: if "currentGLContext" is set, then we expect a GL context to be bound.
(i.e. someone already called glXMakeContextCurrent
or glXMakeCurrent
)
Given that the context is created by Qt when using GUI, I suspect the cause is somewhere in Qt code, or some subtle change in Ogre2RenderEngine::CreateRenderWindow (or parent caller, or code executed up until now) that no longer sets the context.
Or alternatively, a routine caused a call to glXMakeCurrent( display, None, None );
which unsets the context.
I'd put a breakpoint in void GLXContext::endCurrent() { glXMakeCurrent( mGLSupport->getGLDisplay(), None, None ); }
on RenderSystems/GL3Plus/src/windowing/GLX/OgreGLXContext.cpp
to be sure it's not on the Ogre side.
TL;DR there is a missing call to glXMakeCurrent
that must happen before Ogre is created, or an extra glXMakeCurrent( display, None, None );
that shouldn't be there.
But given that this only happens with VGL (instead of always) I have no idea why that would happen.
Perhaps Qt no longer uses glXMakeCurrent
and uses something else that can still be retrieved with glXGetCurrentContext
but broke with VGL.
Here is a gdb trace of just the GUI printing every call of glXMakeCurrent
(without VGL):
And the same kind of trace with VGL:
If I found the right Qt source, it should be this one handling the context: https://code.woboq.org/qt5/qtbase/src/plugins/platforms/xcb/gl_integrations/xcb_glx/qglxintegration.cpp.html . It seems it is still full of glXMakeCurrent()
calls...
I looked into this a little and was able to reproduce this issue using the container suggested in https://github.com/gazebosim/gz-sim/issues/1746#issuecomment-1264614220. For me, I noticed that the GL context was lost (glXGetCurrentContext
returns null) after loading the OGRE GL3Plus plugin. Specifically it happens after the plugin tried to query EGL support. If I comment out this line then I no longer get the crash:
https://github.com/OGRECave/ogre-next/blob/v2-2/RenderSystems/GL3Plus/src/windowing/OgreGlSwitchableSupport.cpp#L69
Similarly if I build ogre-next from source without EGL support then gazebo works fine.
Mmm, EglPBufferSupport::initDevice
will start querying all the GPUs and try to create them; in order to see which ones are compatible.
Including calls to eglMakeCurrent
(which may override those from glXMakeCurrent
).
Perhaps if Gazebo manually saves the GL context before loading the GL3+ plugin, and restores the context after loading it (and before creating a window) this problem can be fixed?
I don't know why the problem is specific to VirtualGL though.
From https://registry.khronos.org/EGL/sdk/docs/man/html/eglMakeCurrent.xhtml:
For purposes of eglMakeCurrent, the client API type of all OpenGL ES and OpenGL contexts is considered the same. In other words, if any OpenGL ES context is currently bound and context is an OpenGL context, or if any OpenGL context is currently bound and context is an OpenGL ES context, the currently bound context will be made no longer current and context will be made current.
So it makes sense the EglPBufferSupport::initDevice()
calls reset the GLX buffer.
I think it now makes sense to me.
When running without VirtualGL, the GLX context is created, and then several EGL contexts are probed, each with its own display.
When running with VirtualGL, the GLX context is emulated with eglMakeCurrent()
, and then further EGL contexts are probed, also using eglMakeCurrent()
. One of these contexts' displays however matches the emulated GLX display, which results in the observed error.
I don't think there's anything that could be fixed on the VirtualGL side. It just can't know if you want to actually change the current EGL context, or if it should remain as it was for the GLX emulation. Or, it would have to report the EGL display used for emulation as unavailable for direct EGL operation. I also asked about this on the VirtualGL repo.
I've written an MWE where you can easily test this behavior. The two lines with comment // FIX
suggest how to fix this problem (they would normally go around loadPlugins()
). I've tested Gazebo 6 with this fix and it worked!
Compile with g++ -o mwe mwe.cpp -lGL -lGLU -lX11 -lEGL
.
cool, thanks for digging into this! Do you mind submitting a PR with your fix?
I found this issue while trying to run Gazebo Garden under VirtualGL inside a container. I don't mean to derail the conversation here but I have an observation:
Gazebo always crashes as described when VirtualGL is run with the EGL backend, but I have gotten it to run to a limited extent with the GLX backend. Some environments work OK (like multi_lrauv_race.sdf
), and some cause an immediate crash (like tethys_at_empty_environment.sdf
from the osrf/lrauv project). The crashing condition was filed as https://github.com/gazebosim/gz-sim/issues/1746 and duped here. Are these really the same issue?
(I don't know enough about VirtualGL or the graphics stack to know the difference between the backends or which one should be preferred. I'd be happy if either of them worked fully.)
I don't know why the different worlds from the lrauv project result in different behavior. Just looking at the ogre log in https://github.com/gazebosim/gz-sim/issues/1746, it has the same errors as the one reported here about missing gl context so it could be due to the same issue.
Here is a DEB I've built from the main branch of virtualgl with the patch from https://github.com/VirtualGL/virtualgl/issues/220#issuecomment-1358529845: virtualgl_3.0.91_amd64-fixed.zip. With this patched version of VirtualGL, the crashes no more occur to me.
However, following the discussion in the VirtualGL issue, I still think it'd make sense to also implement the fix on Gazebo side - it seems that the fact the current procedure worked is more likely to be an accident than design. It seems that mixing both GLX and EGL calls in a single program is something nobody does and we haven't found any references saying how the interaction between the two should work. I'll prepare the Gazebo PR.
I cherry-picked this change to the gz-rendering7
branch then ran Gazebo Garden with @peci1's build of VirtualGL using the EGL backend, with the multi_lrauv_race.sdf
world.
Indeed I no longer get the error mentioning GLXWindow::create
, but it crashes reliably. However the GLX backend appears to load worlds I couldn't before (maybe a red herring).
Actually, the PR to gz-rendering itself should be sufficient to fix the rendering issues. The patched VirtualGL binary was another approach to fix the issue "on the other end". Could you try with the non-patched VirtualGL?
I'll close this issue as the original one has been resolved. Let's continue this discussion in https://github.com/gazebosim/gz-sim/issues/1746 or a new issue. You can try specifying the VirtualGL device as egl0
as apparently, /dev/dri/card0
has two way of being accessed via EGL, only one of which actually works.
The VirtualGL-side fix has also been implemented: https://github.com/VirtualGL/virtualgl/issues/220 . So now this bug should be prevented from both sides :) I did a thorough compatibility test and all combinations I used worked.
Environment
~/.ignition/rendering
Description
Steps to reproduce
sudo chmod g+rw /dev/dri/card0
vglrun +v -d /dev/dri/card0 ign gazebo -v4
Output
VirtualGL is a nice way to run Gazebo in many constrained environments where e.g. a fully-fledged X server cannot be run, or it is running on a wrong GPU etc. We've been using it all the time in SubT challenge with Dome without problems. The above example works correctly in Dome and Edifice and fails in Fortress. I know Fortress came with headless rendering support, but VirtualGL is more mighty - it can also redirect the GUI to use EGL (when connected with xvfb).
The EGL backend of VirtualGL works so that it intercepts GLX calls from the application and substitutes them with relevant EGL calls. This translation/faking layer is not 100% feature complete. However, with the error being so non-informative, I have no idea what could be wrong (whether it's gazebo-side or virtualgl-side).
Here is a trace of the VirtualGL function interposer, but I can't make anything useful from it: https://pastebin.com/nsCJ8SSE