VirtualGL / virtualgl

Main VirtualGL repository
https://VirtualGL.org
Other
701 stars 106 forks source link

webots: blinking vgl, some times #172

Closed juju2013 closed 2 years ago

juju2013 commented 3 years ago

Application: webots R2021a (https://www.cyberbotics.com/) VirtualGL: 2.6.90 TurboVNC: 2.2.6 Container OS: Ubuntu 18.04.5 LTS Host OS: Centos 7 GPU: Tesla M2090, driver version: 390.116

When webots runs simulations, some run all good, but some have the main display (vgl window) blinking. Screen recorded example: https://video.fbxl.net/videos/watch/3389fa83-c31e-4cc4-a861-805e94b16a59

By trial and error, it looks like that wired behavior is linked to some "robots" (PROTO Nodes/robots), but not all of them; in the same simulation, adding some robots will cause blinking, remove them and add other robots and the simulation will run OK.

Here's a trace output of a (almost) empty simulation: http://dl.free.fr/lH4IZIt0F

The same plus a Boston Dynamics dog (screen recorded one): http://dl.free.fr/wl8bbqSMD

Command to lauch webots: vglrun +v -d /dev/dri/card1 webots

Docker compose file:

webots: container_name: webots shm_size: 2G image: webots runtime: nvidia privileged: true environment:

  • NVIDIA_VISIBLE_DEVICES=all networks: dockernet: ipv4_address: 10.99.0.33 volumes:
  • /home/xxx/webots/:/data/
  • ./init.sh:/init.sh
  • ./xorg.conf:/etc/X11/xorg.conf devices:
  • "/dev/dri"
  • "/dev/vga_arbiter"
  • "/dev/nvidia0"
  • "/dev/nvidia1"
  • "/dev/nvidiactl"
  • "/dev/nvidia-modeset"
  • "/dev/nvidia-uvm"
  • "/dev/nvidia-uvm-tools"
  • "/dev/fb0" command: /init.sh
dcommander commented 3 years ago

Reproduced. Investigating. This issue is specific to the EGL back end, which is not surprising.

dcommander commented 3 years ago

In the process of investigating this issue, I discovered several other issues with the EGL back end and fixed those, but I have yet to find the cause of this issue. Since I have very limited resources to work on the EGL back end, it may take a while for this issue to get fixed. Please be patient.

dcommander commented 3 years ago

Thanks to piglit, I have discovered and fixed numerous conformance issues in the EGL back end, but this issue still eludes me. Either it is somehow related to rarely-used OpenGL and GLX features that are still missing in the EGL back end (see #134 and #136) or it is yet another conformance issue that hasn't been uncovered yet. Unfortunately I must declare defeat for now.

LeehanLee commented 2 years ago

I found that seems only if I launched a robot which included webots/camera.h and used the camera related API, the main display (vgl window) blinking. image image

https://user-images.githubusercontent.com/14244974/167348878-1bcc2b42-ba57-4c61-82d1-f0e56ee8370b.mp4

if I comment out these camera related code and rebuild the project, the blinking disappears:

https://user-images.githubusercontent.com/14244974/167350065-14a3d3db-5bb2-4ec9-8486-037b9fe2ef72.mp4

Don't know how to solve this issue.

dcommander commented 2 years ago

@LeehanLee That is a good clue. I reproduced the issue, and it is almost certainly a bug in VGL, but I have thus far been unable to find the bug. Now that I know that it is specific to one mode of operation, I can hopefully look at the application source code and figure out what that mode of operation does at the OpenGL level.

dcommander commented 2 years ago

I have spent more hours trying to diagnose this, including comparing the apitrace output with and without cameras enabled. Unfortunately I am still at a loss.

LeehanLee commented 2 years ago

I found the description of the Webots function "wb_camera_enable" here: https://www.cyberbotics.com/doc/reference/camera#wb_camera_enable image

you can see from the above screenshot, I changed the second parameter "sampling_period" to "50 time_step"(which was "2 time_step" before I changed it), and then I found that the blinking frequency in the rendering area was slowed down:

https://user-images.githubusercontent.com/14244974/174434355-93698d47-2356-4639-aaf5-244735b3955b.mp4

I'm not sure if this could help you to diagnose this issue.

dcommander commented 2 years ago

That is a good clue. I’ll see if I can find where in the code it copies the image.

dcommander commented 2 years ago

I am now able to build Webots from source and get an OpenGL API trace from it, both with and without cameras enabled. Unfortunately, it hasn't revealed any obvious issues, so I am still clueless. It still isn't clear exactly how enabling cameras changes the OpenGL call sequence. I have tried to add print statements to the code to understand the mechanism by which that happens, but so far it hasn't been revealing. Unfortunately I have to shelve this yet again, as I don't have any more time right now to pursue it.

dcommander commented 2 years ago

Ouch, that was difficult. It ultimately took more than 60 uncompensated hours to diagnose the problem, and I still cannot figure out how to reproduce it in isolation (using fakerut.) However, the following patch seems to fix it:

--- a/server/backend.cpp
+++ b/server/backend.cpp
@@ -73,25 +73,32 @@ static FakePbuffer *getCurrentFakePbuffer(EGLint readdraw)
 void bindFramebuffer(GLenum target, GLuint framebuffer, bool ext)
 {
    #ifdef EGLBACKEND
+   const GLenum *oldDrawBufs = NULL;  GLsizei nDrawBufs = 0;
+   GLenum oldReadBuf = GL_NONE;
+   FakePbuffer *drawpb = NULL, *readpb = NULL;
+
    if(fconfig.egl)
    {
        if(framebuffer == 0)
        {
            if(target == GL_DRAW_FRAMEBUFFER || target == GL_FRAMEBUFFER)
            {
-               FakePbuffer *pb = pbhashegl.find(getCurrentDrawableEGL());
-               if(pb)
+               drawpb = pbhashegl.find(getCurrentDrawableEGL());
+               if(drawpb)
                {
-                   framebuffer = pb->getFBO();
+                   oldDrawBufs =
+                       ctxhashegl.getDrawBuffers(_eglGetCurrentContext(), nDrawBufs);
+                   framebuffer = drawpb->getFBO();
                    ctxhashegl.setDrawFBO(_eglGetCurrentContext(), 0);
                }
            }
            if(target == GL_READ_FRAMEBUFFER || target == GL_FRAMEBUFFER)
            {
-               FakePbuffer *pb = pbhashegl.find(getCurrentReadDrawableEGL());
-               if(pb)
+               readpb = pbhashegl.find(getCurrentReadDrawableEGL());
+               if(readpb)
                {
-                   framebuffer = pb->getFBO();
+                   oldReadBuf = ctxhashegl.getReadBuffer(_eglGetCurrentContext());
+                   framebuffer = readpb->getFBO();
                    ctxhashegl.setReadFBO(_eglGetCurrentContext(), 0);
                }
            }
@@ -107,6 +114,20 @@ void bindFramebuffer(GLenum target, GLuint framebuffer, bool ext)
    #endif
    if(ext) _glBindFramebufferEXT(target, framebuffer);
    else _glBindFramebuffer(target, framebuffer);
+   #ifdef EGLBACKEND
+   if(fconfig.egl)
+   {
+       if(oldDrawBufs)
+       {
+           if(nDrawBufs == 1)
+               drawpb->setDrawBuffer(oldDrawBufs[0], false);
+           else if(nDrawBufs > 0)
+               drawpb->setDrawBuffers(nDrawBufs, oldDrawBufs, false);
+           delete [] oldDrawBufs;
+       }
+       if(oldReadBuf) readpb->setReadBuffer(oldReadBuf, false);
+   }
+   #endif
 }

In a nutshell, the complexity of the Webots camera rendering code exposed a really esoteric aspect of FBO behavior, which is that the draw and read buffer state is attached to the FBO state, but when using the default framebuffer, the draw and read buffer state should be attached to the context instead. I already emulated that behavior in the EGL back end glXMake*Current() functions but didn't realize that I also needed to emulate it in the EGL back end implementation of glBindFramebuffer(..., 0). Ugh. Did I mention how much simpler this would be if EGL supported a "multiview" Pbuffer extension similar to EGL_EXT_multiview_window? I tried to get nVidia on board with that several years ago, but no dice. Thus, here I am with hundreds of unpaid hours invested in the EGL back end, with barely enough money in the General Fund to maintain it, much less divert project resources for weeks to track down complicated bugs with it.

I am going to do some regression testing tomorrow, and I should be able to push this patch by the end of the day.