gazebosim / gz-rendering

C++ library designed to provide an abstraction for different rendering engines. It offers unified APIs for creating 3D graphics applications.
https://gazebosim.org
Apache License 2.0
55 stars 51 forks source link

Race condition between Qt and Ogre2 in presentation #304

Closed darksylinc closed 3 years ago

darksylinc commented 3 years ago

ign-rendering uses the same pattern as provided textureinthread sample from Qt. (the code can be seen in repo)

What should happen

The pattern is the following (ideal, i.e. what should happen):

  1. Worker Thread (e.g. Ogre2) is producer
  2. GUI thread (Qt) is consumer
  3. There are two FBOs, A and B
1. Qt requests Ogre2 to start rendering
2. Ogre renders to FBO B, signals it's done to Qt thread
3. Qt thread atomically swaps A and B
4. Qt thread presents B
5. Qt requests Ogre2 to start rendering
6. Ogre renders to FBO A, signals it's done to Qt thread
7. Qt thread atomically swaps A and B
8. Qt thread presents A
9. Repeat step 1

This is basically a double-buffer scheme maintained by hand.

It is even described in Qt's sample (emphasis mine): QtSample

What is actually happening

  1. Qt requests Ogre2 to start rendering
  2. Ogre renders to FBO A, signals it's done to Qt thread
  3. Qt thread presents A
  4. Repeat step 1

This is causing a race condition because A is both being rendered to while also being used internally as a texture by Qt. This can cause all sorts of weird behaviors or maybe even crashes when Qt UI is used.

Most likely how bad this bug is depends on which driver is being used (e.g. Mesa radeonsi vs Mesa Intel, vs Windows Intel, vs proprietary AMD, vs NVIDIA proprietary, vs Nouveau) and the workload balance and processing speed of the CPU (like all race conditions)

Why it went unnoticed

This bug went unnoticed because Gazebo defaults to using 8x MSAA (but due to bugs in setup, MSAA is not actually used, that's another problem) and the FBO passed to Qt is the resolved one. Due to this, the window for the race condition is extremely small (because FBO is only written to during the resolve operation, which is only a fraction of the whole render).

However once I disable MSAA the problem becomes much more apparent, notice the background flicker:

https://user-images.githubusercontent.com/3395130/114423489-9a847c80-9b8d-11eb-9398-8233669468a2.mp4

Environment

Is Ogre1 affected?

I don't know. I did not evaluate. The possibility is high, given that part of the buggy behavior comes from ign-gui, which is agnostic to both Ogre1 and Ogre2

Is OptiX affected?

Unknown.

Tasks

Affected files

Most of the bugs affect the files coming from:

Other

I will be the one fixing this bug. This ticket is for tracking progress

iche033 commented 3 years ago

thanks for uncovering this bug. Note that there is also a Scene3D.cc in ign-gazebo and that's the one currently being used when ign gazebo is run. There is a goal to merge the two Scene3Ds.

darksylinc commented 3 years ago

Ooofff!!

This is going to be one complex merge.

My fixes are still WIP but still I submitted the PR.

I'm mostly interested in hearing feedback about PR https://github.com/ignitionrobotics/ign-gazebo/pull/774 because I know for certain that ign-rendering is currently breaking the ABI and I know how to fix it (I just have to move SwapFromThread to the derived classes and then have ign-gazebo dynamic cast these classes and call them directly)

However I'm more inexperienced about ign-gazebo. I checked out the install folder generated by the make folders, and I could not find Scene.hh anywhere.

This suggests the classes I touched were never exposed to the user and therefore it is safe to alter them. However I may be wrong.

I'll be waiting on feedback for this. Cheers!

Update: It seems the tests are failing on ign-rendering PR, that was unexpected but I can see why I missed it. I made some local changes to trigger the race condition more aggressively and forgot to test the code without these changes. The module is basically crashing because we're attempting to create the same node definition twice, which should not be happening.

iche033 commented 3 years ago

I know for certain that ign-rendering is currently breaking the ABI and I know how to fix it (I just have to move SwapFromThread to the derived classes and then have ign-gazebo dynamic cast these classes and call them directly)

ign-rendering's ogre2 rendering engine is currently dynamically loaded as a plugin library from ign-gazebo. If a downcast is needed in ign-gazebo, it means we'll need to explicitly link against the ogre2 target, correct? It's doable but that means we'll need to add a few cmake checks and ifdefs to make sure the user has ogre 2.x installed (it's not required since users can use ign-gazebo with just ogre 1.x). I'm leaning towards just merging this into the unstable branch (main branch) in ign-gazebo and ign-rendering for our next release Ignition F. So you can just cherry pick your current changes and apply them to the main branch. We can always backport this change later if you think it's going to help with other bugfixes you're working on.

However I'm more inexperienced about ign-gazebo. I checked out the install folder generated by the make folders, and I could not find Scene.hh anywhere.

Do you mean ign-gazebo's Scene3D.hh (as opposed to ign-rendering's Scene.hh)?

darksylinc commented 3 years ago

ign-rendering's ogre2 rendering engine is currently dynamically loaded as a plugin library from ign-gazebo. If a downcast is needed in ign-gazebo, it means we'll need to explicitly link against the ogre2 target, correct? It's doable but that means we'll need to add a few cmake checks and ifdefs to make sure the user has ogre 2.x installed (it's not required since users can use ign-gazebo with just ogre 1.x).

Mmm. You are right. It will be troublesome to backport.

To put another way: We just need a way to signal that a particular camera needs to do special work to swap its internal state.

If adding forcing the user to upgrade both ign-gazebo and ign-rendering at the same time is an option, then this can be fixed by adding a new class to ign-rendering which takes care of the downcasting. This will preserve ABI compatibility, but will cause trouble if the user upgrades ign-gazebo without upgrading ign-rendering (which could be detected at runtime because the symbol will be missing, and warn the user to upgrade ign-rendering as well).

I'm talking about something along the lines:

class CameraWorkaround
{
    virtual void SwapFromThread( Camera *camera ) = 0;
}

Is this even feasible? It doesn't have to be a class. It could just be a C function we load dynamically at runtime. Hacky, but it preserves the ABI (as long as it's acceptable that upgrading ign-gazebo forces upgrading ign-rendering too)

I'm leaning towards just merging this into the unstable branch (main branch) in ign-gazebo and ign-rendering for our next release Ignition F. So you can just cherry pick your current changes and apply them to the main branch. We can always backport this change later if you think it's going to help with other bugfixes you're working on.

As for how affects my other bugfixes: The current code is bugged and can corrupt GL state (apparently in non-crashing ways at least with what's we've seen so far with Mesa drivers).

That means that:

  1. Sometimes things workout just fine
  2. Occassionally rendering breaks for a single frame
  3. Sometimes everything is broken

Basically whenever we encounter a bug, it's going to be the question: "is this code broken because I made a mistake, or just the race condition again?". It's similar to "blaming the compiler" when a bug crashes. It's almost never the compiler, except this time you have proof the compiler can occasionally misbehave.

However I'm more inexperienced about ign-gazebo. I checked out the install folder generated by the make folders, and I could not find Scene.hh anywhere.

Do you mean ign-gazebo's Scene3D.hh (as opposed to ign-rendering's Scene.hh)?

  • ign-gazebo's Scene3D.hh is not installed - you are free to make changes to this file without worrying about API/ABI compatibility
  • ign-rendering's Scene.hh is installed

Yes, sorry. I meant Scene3D.hh. Great that it's not installed, as I suspected

iche033 commented 3 years ago

If adding forcing the user to upgrade both ign-gazebo and ign-rendering at the same time is an option, then this can be fixed by adding a new class to ign-rendering which takes care of the downcasting. This will preserve ABI compatibility, but will cause trouble if the user upgrades ign-gazebo without upgrading ign-rendering (which could be detected at runtime because the symbol will be missing, and warn the user to upgrade ign-rendering as well).

yes the example you gave is feasible. We've actually done something similar before in other libraries for critical bugfixes.

We can also require users to upgrade ign-rendering dependency version when they upgrade ign-gazebo. Currently ign-gazebo and ign-rendering are on version 5.0.0. Once the changes get in, we can bump both versions to 5.1.0 and tell ign-gazebo to look for ign-rednering 5.1.0 by add VERSION 5.1 here.

Just throwing in another idea. Is it possible to do this through ign-gazebo's Scene3D plugin by using 2 cameras that take turn to render? I imagine the difficulty here is keeping the 2 camera pose control (pan, orbit, zoom, etc) in sync.

darksylinc commented 3 years ago

Just throwing in another idea. Is it possible to do this through ign-gazebo's Scene3D plugin by using 2 cameras that take turn to render? I imagine the difficulty here is keeping the 2 camera pose control (pan, orbit, zoom, etc) in sync.

That may not be a bad idea. I'll have to think a few details about on how feasible are other improvements I have to do (basically saving VRAM)

darksylinc commented 3 years ago

Hi!

I've been thinking about this and I realized I missed a simple element, which I couldn't discuss in the middle the weekend and wait for a reply, but NOW is the right time to discuss this:

Do we even need multithreading here?

I assume rendering is currently happening in a different thread because "that's what the sample we found in the Qt wiki did".

There are pros and cons of this approach:

Pros

Cons

Best of both worlds?

I recommend to keep rendering in another thread to avoid the cryptic GL errors, but serializing the operations.

That means the UI thread is blocked until Ogre is done. I've done this successfully in order to test RenderDoc debugging and it worked. By serializing rendering:

It actually gets better, because newer versions at any time can add support for a double buffer scheme on-demand, where at launch-time the user can leverage whether to enable parallel rendering, sacrificing some VRAM for a potential increase in UI responsiveness, or disable parallel rendering.

I was aiming to do parallel rendering by default, but add a command line argument to request serialized rendering for RenderDoc debugging. However it seems that we should do serialized rendering by default, defer parallel rendering for another time (it can be added any time in future versions) and move on to the next problem.

Thoughts?

iche033 commented 3 years ago

I'm also leaning towards doing serialized rendering by default and keeping the code less complicated (I hope). This sounds like the changes will mainly be done in Scene3D class in the way we integrate ogre3d with Qt. And yes we did parallel rendering as that was what worked at the time, so it's good to hear that you were already able to have something working already.

As for parallel rendering, I think the VRAM sacrifice is relatively small compared to some our typical use cases. But I agree we can leave this for the future if we find limitations in serialized rendering, thanks.