darksylinc commented 3 years ago

ign-rendering uses the same pattern as provided textureinthread sample from Qt. (the code can be seen in repo)

What should happen

The pattern is the following (ideal, i.e. what should happen):

Worker Thread (e.g. Ogre2) is producer
GUI thread (Qt) is consumer
There are two FBOs, A and B

1. Qt requests Ogre2 to start rendering
2. Ogre renders to FBO B, signals it's done to Qt thread
3. Qt thread atomically swaps A and B
4. Qt thread presents B
5. Qt requests Ogre2 to start rendering
6. Ogre renders to FBO A, signals it's done to Qt thread
7. Qt thread atomically swaps A and B
8. Qt thread presents A
9. Repeat step 1

This is basically a double-buffer scheme maintained by hand.

It is even described in Qt's sample (emphasis mine): QtSample

What is actually happening

Qt requests Ogre2 to start rendering
Ogre renders to FBO A, signals it's done to Qt thread
Qt thread presents A
Repeat step 1

This is causing a race condition because A is both being rendered to while also being used internally as a texture by Qt. This can cause all sorts of weird behaviors or maybe even crashes when Qt UI is used.

Most likely how bad this bug is depends on which driver is being used (e.g. Mesa radeonsi vs Mesa Intel, vs Windows Intel, vs proprietary AMD, vs NVIDIA proprietary, vs Nouveau) and the workload balance and processing speed of the CPU (like all race conditions)

Why it went unnoticed

This bug went unnoticed because Gazebo defaults to using 8x MSAA (but due to bugs in setup, MSAA is not actually used, that's another problem) and the FBO passed to Qt is the resolved one. Due to this, the window for the race condition is extremely small (because FBO is only written to during the resolve operation, which is only a fraction of the whole render).

However once I disable MSAA the problem becomes much more apparent, notice the background flicker:

https://user-images.githubusercontent.com/3395130/114423489-9a847c80-9b8d-11eb-9398-8233669468a2.mp4

Environment

OS Version: All affected
Source or binary build? Source. Probably all branches where Ogre2 is used

Is Ogre1 affected?

I don't know. I did not evaluate. The possibility is high, given that part of the buggy behavior comes from ign-gui, which is agnostic to both Ogre1 and Ogre2

Is OptiX affected?

Unknown.

Tasks

[x] ~Ensure FBO juggling is thread safe (https://github.com/ignitionrobotics/ign-rendering/pull/307) ~. We're now forcing serialization. See https://github.com/ignitionrobotics/ign-gazebo/pull/774
[x] ~Add "debug mode" for blocking (serializing) Qt and Ogre2 threads for decent RenderDoc debugging (otherwise API calls become interleaved and is extremely confusing)~- We're now forcing serialization.
[x] Fix compositor setup regarding textures. See https://github.com/ignitionrobotics/ign-rendering/pull/313
[x] Fix MSAA usage. See https://github.com/ignitionrobotics/ign-rendering/pull/313
[x] Stretch goal: Add single threaded mode (some API analyzer tools can be picky). Update: This is no longer optional. See https://github.com/ignitionrobotics/ign-gazebo/pull/774

Affected files

Most of the bugs affect the files coming from:

Ogre2RenderTarget.cc
Ogre2Camera.cc (since Ogre2Camera::RenderTextureGLId redirects to Ogre2RenderTarget's GLId)
Scene3D.cc from ign-gui

Other

I will be the one fixing this bug. This ticket is for tracking progress

iche033 commented 3 years ago

thanks for uncovering this bug. Note that there is also a Scene3D.cc in ign-gazebo and that's the one currently being used when ign gazebo is run. There is a goal to merge the two Scene3Ds.

darksylinc commented 3 years ago

Ooofff!!

This is going to be one complex merge.

My fixes are still WIP but still I submitted the PR.

I'm mostly interested in hearing feedback about PR https://github.com/ignitionrobotics/ign-gazebo/pull/774 because I know for certain that ign-rendering is currently breaking the ABI and I know how to fix it (I just have to move SwapFromThread to the derived classes and then have ign-gazebo dynamic cast these classes and call them directly)

However I'm more inexperienced about ign-gazebo. I checked out the install folder generated by the make folders, and I could not find Scene.hh anywhere.

This suggests the classes I touched were never exposed to the user and therefore it is safe to alter them. However I may be wrong.

I'll be waiting on feedback for this. Cheers!

Update: It seems the tests are failing on ign-rendering PR, that was unexpected but I can see why I missed it. I made some local changes to trigger the race condition more aggressively and forgot to test the code without these changes. The module is basically crashing because we're attempting to create the same node definition twice, which should not be happening.

iche033 commented 3 years ago

I know for certain that ign-rendering is currently breaking the ABI and I know how to fix it (I just have to move SwapFromThread to the derived classes and then have ign-gazebo dynamic cast these classes and call them directly)

ign-rendering's ogre2 rendering engine is currently dynamically loaded as a plugin library from ign-gazebo. If a downcast is needed in ign-gazebo, it means we'll need to explicitly link against the ogre2 target, correct? It's doable but that means we'll need to add a few cmake checks and ifdefs to make sure the user has ogre 2.x installed (it's not required since users can use ign-gazebo with just ogre 1.x). I'm leaning towards just merging this into the unstable branch (main branch) in ign-gazebo and ign-rendering for our next release Ignition F. So you can just cherry pick your current changes and apply them to the main branch. We can always backport this change later if you think it's going to help with other bugfixes you're working on.

However I'm more inexperienced about ign-gazebo. I checked out the install folder generated by the make folders, and I could not find Scene.hh anywhere.

Do you mean ign-gazebo's Scene3D.hh (as opposed to ign-rendering's Scene.hh)?

ign-gazebo's Scene3D.hh is not installed - you are free to make changes to this file without worrying about API/ABI compatibility
ign-rendering's Scene.hh is installed

darksylinc commented 3 years ago

ign-rendering's ogre2 rendering engine is currently dynamically loaded as a plugin library from ign-gazebo. If a downcast is needed in ign-gazebo, it means we'll need to explicitly link against the ogre2 target, correct? It's doable but that means we'll need to add a few cmake checks and ifdefs to make sure the user has ogre 2.x installed (it's not required since users can use ign-gazebo with just ogre 1.x).

Mmm. You are right. It will be troublesome to backport.

To put another way: We just need a way to signal that a particular camera needs to do special work to swap its internal state.

If adding forcing the user to upgrade both ign-gazebo and ign-rendering at the same time is an option, then this can be fixed by adding a new class to ign-rendering which takes care of the downcasting. This will preserve ABI compatibility, but will cause trouble if the user upgrades ign-gazebo without upgrading ign-rendering (which could be detected at runtime because the symbol will be missing, and warn the user to upgrade ign-rendering as well).

I'm talking about something along the lines:

class CameraWorkaround
{
    virtual void SwapFromThread( Camera *camera ) = 0;
}

Is this even feasible? It doesn't have to be a class. It could just be a C function we load dynamically at runtime. Hacky, but it preserves the ABI (as long as it's acceptable that upgrading ign-gazebo forces upgrading ign-rendering too)

I'm leaning towards just merging this into the unstable branch (main branch) in ign-gazebo and ign-rendering for our next release Ignition F. So you can just cherry pick your current changes and apply them to the main branch. We can always backport this change later if you think it's going to help with other bugfixes you're working on.

As for how affects my other bugfixes: The current code is bugged and can corrupt GL state (apparently in non-crashing ways at least with what's we've seen so far with Mesa drivers).

That means that:

Sometimes things workout just fine
Occassionally rendering breaks for a single frame
Sometimes everything is broken

Basically whenever we encounter a bug, it's going to be the question: "is this code broken because I made a mistake, or just the race condition again?". It's similar to "blaming the compiler" when a bug crashes. It's almost never the compiler, except this time you have proof the compiler can occasionally misbehave.

However I'm more inexperienced about ign-gazebo. I checked out the install folder generated by the make folders, and I could not find Scene.hh anywhere.

Do you mean ign-gazebo's Scene3D.hh (as opposed to ign-rendering's Scene.hh)?

ign-gazebo's Scene3D.hh is not installed - you are free to make changes to this file without worrying about API/ABI compatibility

ign-rendering's Scene.hh is installed

Yes, sorry. I meant Scene3D.hh. Great that it's not installed, as I suspected

iche033 commented 3 years ago

If adding forcing the user to upgrade both ign-gazebo and ign-rendering at the same time is an option, then this can be fixed by adding a new class to ign-rendering which takes care of the downcasting. This will preserve ABI compatibility, but will cause trouble if the user upgrades ign-gazebo without upgrading ign-rendering (which could be detected at runtime because the symbol will be missing, and warn the user to upgrade ign-rendering as well).

yes the example you gave is feasible. We've actually done something similar before in other libraries for critical bugfixes.

We can also require users to upgrade ign-rendering dependency version when they upgrade ign-gazebo. Currently ign-gazebo and ign-rendering are on version 5.0.0. Once the changes get in, we can bump both versions to 5.1.0 and tell ign-gazebo to look for ign-rednering 5.1.0 by add VERSION 5.1 here.

Just throwing in another idea. Is it possible to do this through ign-gazebo's Scene3D plugin by using 2 cameras that take turn to render? I imagine the difficulty here is keeping the 2 camera pose control (pan, orbit, zoom, etc) in sync.

darksylinc commented 3 years ago

Just throwing in another idea. Is it possible to do this through ign-gazebo's Scene3D plugin by using 2 cameras that take turn to render? I imagine the difficulty here is keeping the 2 camera pose control (pan, orbit, zoom, etc) in sync.

That may not be a bad idea. I'll have to think a few details about on how feasible are other improvements I have to do (basically saving VRAM)

darksylinc commented 3 years ago

Hi!

I've been thinking about this and I realized I missed a simple element, which I couldn't discuss in the middle the weekend and wait for a reply, but NOW is the right time to discuss this:

Do we even need multithreading here?

I assume rendering is currently happening in a different thread because "that's what the sample we found in the Qt wiki did".

There are pros and cons of this approach:

Pros

If CPU bound (but there are cores available), threading increases UI responsiveness because Qt can finish and display the UI while Ogre is still rendering a single frame. So the simulation may display at e.g. 1 FPS while the UI responds immediately.
If GPU bound, threading won't help in UI responsiveness because both threads are still competing for access to the same GPU.
Having a separate GL context in a different thread helps avoiding cryptic OpenGL and/or UI errors (because we must restore GL state and context after we're done, so that Qt can continue normally)

Cons

Parallel rendering increases VRAM consumption, because we need a double buffer scheme to cycle RenderTargets
Increased code complexity
Increased debuggability difficulty
Always-responsive UI is still hard to keep despite being threaded, depending on how things are coded. For example if for every mouse click the UI thread sends a message to the Ogre thread by locking a mutex, the UI thread will become blocked; and things won't feel responsive at all.

Best of both worlds?

I recommend to keep rendering in another thread to avoid the cryptic GL errors, but serializing the operations.

That means the UI thread is blocked until Ogre is done. I've done this successfully in order to test RenderDoc debugging and it worked. By serializing rendering:

Debuggability becomes easier
VRAM consumption doesn't double, because we no longer need a double buffer scheme
No changes are needed in ign-rendering
ABI won't break, as the changes are only needed in ign-gazebo which is not exposed to the user

It actually gets better, because newer versions at any time can add support for a double buffer scheme on-demand, where at launch-time the user can leverage whether to enable parallel rendering, sacrificing some VRAM for a potential increase in UI responsiveness, or disable parallel rendering.

I was aiming to do parallel rendering by default, but add a command line argument to request serialized rendering for RenderDoc debugging. However it seems that we should do serialized rendering by default, defer parallel rendering for another time (it can be added any time in future versions) and move on to the next problem.

Thoughts?

iche033 commented 3 years ago

I'm also leaning towards doing serialized rendering by default and keeping the code less complicated (I hope). This sounds like the changes will mainly be done in Scene3D class in the way we integrate ogre3d with Qt. And yes we did parallel rendering as that was what worked at the time, so it's good to hear that you were already able to have something working already.

As for parallel rendering, I think the VRAM sacrifice is relatively small compared to some our typical use cases. But I agree we can leave this for the future if we find limitations in serialized rendering, thanks.

gazebosim / gz-rendering

Race condition between Qt and Ogre2 in presentation #304

What should happen

What is actually happening

Why it went unnoticed

Environment

Is Ogre1 affected?

Is OptiX affected?

Tasks

Affected files

Other

Do we even need multithreading here?

Pros

Cons

Best of both worlds?