gazebosim / gz-sim

Open source robotics simulator. The latest version of Gazebo.
https://gazebosim.org
Apache License 2.0
668 stars 261 forks source link

Support for multiple worlds #18

Open osrf-migration opened 5 years ago

osrf-migration commented 5 years ago

Original report (archived issue) by Louise Poubel (Bitbucket: chapulina, GitHub: chapulina).


Description

Since the very beginning we've had support for multiple <world> tags within a single SDF file. The current behaviour is that a simulation runner is started for each of them. However, as the code evolved, that use case became less well supported.

Steps to Reproduce

At the moment, the following crashes with an ODE assertion:

ign-gazebo -f test/worlds/multiple_worlds.sdf

ODE INTERNAL ERROR 1: assertion "!colliders_initialized" failed in dInitColliders() [collision_kernel.cpp:168]

ODE INTERNAL ERROR 1: assertion "!colliders_initialized" failed in dInitColliders() [collision_kernel.cpp:168]

Expected behavior:

At a minimum, no crash. At a maximum, multiple worlds spun up in parallel and running in lockstep, the UI displaying all of them side-by-side.

Actual behavior:

Server crashes, UI freezes.

Reproduces how often:

100% of the time

Versions

ign-gazebo1

Additional Information

I'm not sure if it's best at the moment to:

osrf-migration commented 5 years ago

Original comment by Michael Grey (Bitbucket: mxgrey, GitHub: mxgrey).


This might be something that needs to be addressed in the ignition-physics-dartsim plugin.

Is this happening due to multi-threading, or will it still happen even with only a single thread? If it's a multi-threading race condition issue, then maybe we just need to put a mutex in a sensible place in the ignition-physics-dartsim plugin.

osrf-migration commented 5 years ago

Original comment by Addisu Z. Taddese (Bitbucket: azeey, GitHub: azeey).


I believe this is a race condition on global variables in ODE. I was experimenting with a solution here (914d44a96ac3e097d8d5c9f6ecb92a22b203303e), but ignition-physics-dartsim is probably a more appropriate place to put the mutex.

osrf-migration commented 5 years ago

Original comment by Louise Poubel (Bitbucket: chapulina, GitHub: chapulina).


By the way, the same problem happens if we try to spin up 2 secondary servers running physics simulation in the same test, to test distributed simulation.

osrf-migration commented 4 years ago

Original comment by Diego Ferigo (Bitbucket: dgferigo).


Is there any update on this problem? I started testing multi-world simulations and I stumbled upon this problem right away.

If this feature is not yet completed, it would be nice at least having a warning printed when the world file is loaded and multiple worlds are found.

osrf-migration commented 4 years ago

Original comment by Diego Ferigo (Bitbucket: dgferigo).


Thanks @azeey for the fix you proposed. I know it’s a hacky workaround, though it works reliably. Do you see any better solution that does not involve static mutexes?

osrf-migration commented 4 years ago

Original comment by Diego Ferigo (Bitbucket: dgferigo).


Quick update. With fixed-base simple models like pendulum and cartpole the mutex workaround is effective as I reported in my last comment.

However, with complex floating-base models like a humanoid robot with meshes both for visual and collisions, a multiworld setup still segfaults.

I managed to “isolate” the problem, if we can say it. In PhysicsPrivate::Step, locking the same mutex in this scope prevents the segfault. Though, this is not a solution since it means that the real step of the world is not concurrently executed by the threads that handle the different worlds (one for each SimulationRunner). In other words, this mutex prevents stepping the worlds in parallel. I think that parallel execution within a single process is the main feature that triggers downstream users to use a multiworld setup.

Here below the stack trace (text here):

osrf-migration commented 4 years ago

Original comment by Addisu Z. Taddese (Bitbucket: azeey, GitHub: azeey).


Looks like there is a global cache used by the collision detector in ODE. One thing to try would be to build ode with --enable-ou,

osrf-migration commented 4 years ago

Original comment by Diego Ferigo (Bitbucket: dgferigo).


Thanks for the suggestion! From my side recompiling ODE would mean that all the chain ODE → DART → ign-physics should be compiled from sources. In this moment I doubt I can dedicate enough time to set everything up and start debugging the problem (also considering that ODE code is not very straighforward to read) :confused:

traversaro commented 4 months ago

Original comment by Addisu Z. Taddese (Bitbucket: azeey, GitHub: azeey).

Looks like there is a global cache used by the collision detector in ODE. One thing to try would be to build ode with --enable-ou,

We recently experienced this again with @xela-95 in https://github.com/robotology/gz-sim-yarp-plugins/pull/153 . A workaround is to change DART's collision detector not to use ODE . At least on conda-forge, it is relatively easy to set ENABLE_OU option in CMake, I checked and it does not affect the public ABI of ode, while on apt is more complex and can't be done for stable distros. However, I am not sure if this is helpful in the case two servers are stepped by the same thread (unless a new thread is always created by the server, in that case indeed that would solve the problem).

traversaro commented 4 months ago

One thing that we could think of at some point in the future (clearly not in an already released version of Gazebo) is to switch the default collision detector in DART from ODE to something else, I am not sure if ODE was chosen just for historical reasons or indeed the is the best performing collision detector for Gazebo's needs.

azeey commented 4 months ago

I am not sure if ODE was chosen just for historical reasons or indeed the is the best performing collision detector for Gazebo's needs.

If I remember correctly, it was chosen because it worked much better than DART or FCL as collision detectors. I remember FCL converted primitive shapes to meshes, which caused unexpected behaviors. And I think DART as a collision engine is missing some features. It's been a while though, so it might be worth revisiting. Also, bullet might be a viable option, but upstream dartsim needs some changes to make mesh-to-mesh collisions work.

traversaro commented 4 months ago

remember FCL converted primitive shapes to meshes, which caused unexpected behaviors.

That seems to be still the case, see the corresponding DART issue: https://github.com/dartsim/dart/issues/19 (that seems to be blocked by something, but I am not sure which FCL issue, perhaps https://github.com/flexible-collision-library/fcl/issues/106 ? ).

traversaro commented 4 months ago

Yes, indeed looking at bit at issues it seems that each collision detector has some issues, so there is no clearly better DART collision detector, xref: https://github.com/gazebosim/gz-sim/issues/1306 . Also in https://github.com/resibots/robot_dart it seems that the collision detector used changes from example to example.

traversaro commented 2 weeks ago

I investigate a bit the problem as it occurs in https://github.com/robotology/gz-sim-yarp-plugins/issues/201 for some specific test. I was able to fix the crash with error:

ODE INTERNAL ERROR 1: assertion "!colliders_initialized" failed in dInitColliders() [collision_kernel.cpp:168]

with the following DART patch:

diff --git a/dart/collision/ode/OdeCollisionDetector.cpp b/dart/collision/ode/OdeCollisionDetector.cpp
index abfd232..ca45ccc 100644
--- a/dart/collision/ode/OdeCollisionDetector.cpp
+++ b/dart/collision/ode/OdeCollisionDetector.cpp
@@ -49,6 +49,8 @@
 #include "dart/dynamics/SoftMeshShape.hpp"
 #include "dart/dynamics/SphereShape.hpp"

+#include <iostream>
+
 namespace dart {
 namespace collision {

@@ -216,9 +218,14 @@ double OdeCollisionDetector::distance(
 OdeCollisionDetector::OdeCollisionDetector()
 {
   // Initialize ODE. dInitODE is deprecated.
-  const auto initialized = dInitODE2(0);
-  assert(initialized);
-  DART_UNUSED(initialized);
+  static std::mutex odeInitMutex;
+
+  {
+    std::unique_lock<std::mutex> lock(odeInitMutex);
+    const auto initialized = dInitODE2(0);
+    assert(initialized);
+    DART_UNUSED(initialized);
+  }

   dAllocateODEDataForThread(dAllocateMaskAll);

that is probably similar to the gz-sim solution proposed by @azeey in https://github.com/gazebosim/gz-sim/commit/914d44a96ac3e097d8d5c9f6ecb92a22b203303e . This solves the test failures, but those tests use simple models, while apparently this solution still have problems for complex model (like humanoid robots). Probably we can try to extend https://github.com/robotology/gz-sim-yarp-plugins/issues/201 to also include more complex models.

fyi @xela-95

traversaro commented 2 weeks ago

I noticed something unfortunate related to:

ODE INTERNAL ERROR 1: assertion "!colliders_initialized" failed in dInitColliders() [collision_kernel.cpp:168]

flaky errors. I tried to work around it by setting another collision manager, but sometimes the test continue to fail. I then realized that the problem is that even if you set another collider, for each world an instance of OdeCollisionDetector is anyway created in ConstructEmptyWorld (see https://github.com/gazebosim/gz-physics/blob/a068d3f730b8eb2886eff72a8828930489789bc9/dartsim/src/EntityManagementFeatures.cc#L728), that is called by ConstructSDFWorld (https://github.com/gazebosim/gz-physics/blob/a068d3f730b8eb2886eff72a8828930489789bc9/dartsim/src/SDFFeatures.cc#L492). The collision detector is then overwritten in https://github.com/gazebosim/gz-physics/blob/a068d3f730b8eb2886eff72a8828930489789bc9/dartsim/src/WorldFeatures.cc#L41-L74, but this is too late to avoid that dInitODE2 is called two times in two different threads, resulting in the race condition that can result in:

ODE INTERNAL ERROR 1: assertion "!colliders_initialized" failed in dInitColliders() [collision_kernel.cpp:168]