gazebo-forks / dart

Dynamic Animation and Robotics Toolkit
http://dartsim.github.io/
BSD 2-Clause "Simplified" License
6 stars 5 forks source link

Fix grouping of contact constraints #6

Closed azeey closed 4 years ago

azeey commented 4 years ago

DART crashes when simulating a modest number of objects that come into contact with each other.

To test, run the following file in ign-gazebo (ign gazebo -v 4 cube5.sdf.txt) cube5.sdf.txt (courtesy of @mjcarroll)

Before the PR, ign-gazebo crashes with

ODE INTERNAL ERROR 1: assertion "bNormalizationResult" failed in _dNormalize4() [odemath.h:42]

Peek 2020-06-09 19-22-crash

With the PR: Peek 2020-06-09 19-23

For a given set of contact constraints, the constraint solver first creates a grouping of related constraints, and then solves each group independently. There seems to be a problem with this grouping and some constraints end up being in two groups.

The ContactConstraint::uniteSkeleton function unites two skeletons associated with a given contact constraint. It does that by setting the mUnionRootSkeleton member of one of the skeletons to point to the current root skeleton of the second skeleton. To decide which of the two skeletons' root node becomes the new union's root node, the size of the skeletons' current unions are compared.

As an example, let A,B,C, and D be skeletons associated with 3 contact constraints. When ContactConstraint::uniteSkeleton is called for constraints 1 and 2, skeleton B is chosen to be the root node.

1. A, B = A -> B
2. C, A = C -> B // Since the root of A is B, C's mUnionRootSkeleton now points to B
3. D, B = B -> D // Let's assume D is chosen because D's union size is larger.

When D is chosen for constraint 3, A and C are not updated. Thus, later on when ContactConstraing::getRootSkeleton is called on constraint 1 or 2, B is returned instead of D. This results in error in the grouping of constraints.

It's not clear to me why this error causes a crash, but the hint that led me to a solution is that I tried disabling the grouping altogether and letting all the constraints be solved in a single call to ConstraingSolver::solveConstrainedGroup and that didn't crash. My guess is that when a constraint ends up in two groups, a solution for that constraint is computed twice and the resulting impulses are erroneously accumulated.

I'm not sure how to properly/reliably test this as it seems to occur randomly.

/cc @scpeters, @mxgrey

nkoenig commented 4 years ago

This also worked for me. My approval is just based on testing the solution.

scpeters commented 4 years ago

I tried disabling the grouping altogether and letting all the constraints be solved in a single call

so is that what this PR is doing? I'm not familiar with the rest of the code, so it's not clear to me

nkoenig commented 4 years ago

I tried disabling the grouping altogether and letting all the constraints be solved in a single call

so is that what this PR is doing? I'm not familiar with the rest of the code, so it's not clear to me

@azeey

scpeters commented 4 years ago

he explained it to me on slack

nkoenig commented 4 years ago

Cool, safe to merge @azeey ?

azeey commented 4 years ago

Yes, I think it's safe to merge. For additional clarity, I instrumented the code to output a dot file and generated graphs of the constraints and their grouping. For the attached SDF file, here is an example grouping for frame 413

constraints413

The ellipses represent the skeletons (cubes and ground_plane) and the edges represent a contact constraint between two skeletons. The light grey edges are constraints where one of the skeletons is a static body. The rectangles represent the constraint groupings. With proper grouping, each node in the graph should only have an edge to another node inside its group (rectangle). The exception is for static/immobile skeletons which are allowed to have constraints across multiple groups because impulses are not applied on them. But notice how model_3_0_3 has a constraint with a model_3_0_4, a skeleton outside its group. I believe in this example, the group labeled model_2_3_4 and model_3_0_0 should have been one group.

In contrast, here is frame 413 after the fix. Granted, the collisions in this frame are going to be different from the one run without the fix, but I wanted to show that this PR does not disable grouping:

constraints413

diegoferigo commented 4 years ago

Great work @azeey and amazing analysis in https://github.com/azeey/dart/pull/6#issuecomment-642201933! We are struggling a lot for a while now with this error and I can't wait to see this change in action. I'm just checking, do the released packages 6.10.0~osrf6~2020-06-10~da7758ae4b7a38 already contain this fix?

I also tag @jrivero here that might know as well.

azeey commented 4 years ago

Thanks @diegoferigo. Yes, the latest debs (6.10.0~osrf6...) should have the fix.

diegoferigo commented 4 years ago

Ok got it, thanks! I just checked the machine where this error occurs more often and I realized that there was the older 6.10.0~osrf1~20190718~fdde7e7894ebc36b version. Thanks for confirming.