DiscreteUpdateManager::AppendDiscreteContactPairsForPointContact assumes that it owns all of the geometry

RussTedrake commented 2 days ago

What happened?

After updating to the most recent Drake version (from 1.30 to 1.34), the example here crashes with an extremely unhelpful

IndexError: _Map_base::at

After a few hours (!!) of printf debugging through building the python bindings, I've narrowed down the problem: 1) the example builds a diagram with both MultibodyPlant and a QuadrotorGeometry added to the same SceneGraph. 2) in DiscreteUpdateManager::AppendDiscreteContactPairsForPointContact, the raw calls to

    const BodyIndex body_B_index = geometry_id_to_body_index().at(pair.id_B);

throw the error. Basically, that method is tacitly assuming that all of the geometry belongs to the MbP. Fyi @xuchenhan-tri , since your name is on those lines of code and @amcastro-tri to triage.

It would take more work for me to come up with a more minimal repro, and I'm afraid I'm out of time for this morning, but I think the incorrectness of this code is now clear? I also haven't bisected enough to know why this appeared for me switching from 1.30 => 1.34... these lines seem older than that, but presumably something upstream in the code path changed.

Version

(this is reproducible from drake master)

What operating system are you using?

Ubuntu 22.04

What installation option are you using?

pip install drake

Relevant log output

No response

xuchenhan-tri commented 2 days ago

I can take a look at this.

xuchenhan-tri commented 2 days ago

Can you provide your command to reproduce this? I want to pin down the upstream change that caused this.

I tried

python3 -m venv venv
source venv/bin/activate
pip3 install underactuated --extra-index-url https://drake-packages.csail.mit.edu/whl/nightly/
python -m ipykernel install --user --name=underactuated
jupyter notebook underactuated/book/trajopt/gcs_quadrotor.ipynb

and I'm getting a missing dependency:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 8
      5 from pydrake.planning import GcsTrajectoryOptimization
      7 from underactuated import running_as_notebook
----> 8 from underactuated.uav_environment import (
      9     CONVEX_GCS_OPTION,
     10     NONLINEAR_GCS_OPTION,
     11     UavEnvironment,
     12 )

File ~/sandbox/underactuated/venv/lib/python3.10/site-packages/underactuated/uav_environment.py:9
      7 import matplotlib.colors as mcolors
      8 import numpy as np
----> 9 from lxml.etree import Element, ElementTree, SubElement, tostring
     11 # from ngcs.visualization import TraceVisualizer, cmap_ibm
     12 from pydrake.common.value import Value

ModuleNotFoundError: No module named 'lxml'

RussTedrake commented 1 day ago

My intention was that this set of commands would have been sufficient, but you caught a gap in my CI (lxml is a doc only dependency, but it snuck in here as a primary dependency for this example only). I've just pushed new pip wheels for underactuated to fix this; it should work now. (or worst case, you can just pip install lxml in your venv) Thanks.

xuchenhan-tri commented 1 day ago

The test fails with 1.31.0 but not with 1.30.0. Unfortunately, we don't have the nightly wheels from that far back (the nightlies are kept for about 2 months or so), so I don't have a good way to bisect further. (Let me know if anyone has a better way to narrow down the culprit commit).

My best guess so far is Jeremy's GeometryContactData cache (which was introduced in 1.31.0) somehow triggered a previously ignored contact pair to be processed by MbP. In any case, the fix seems to be straightforward. I'll put up a PR this.

My question to @RussTedrake is, is there a good way to test the effect of the fix on underactuated (other than waiting for a new Drake nightly)? BTW, if there's a way, then I can bisect further to find the original Drake commit that triggered the failure.

RussTedrake commented 1 day ago

Wow. Thanks @xuchenhan-tri . I'm not worried about my example, per se. I could work around it for that example if needed. But I think it caught a real problem in the DiscreteUpdateManager (which can be observed just by reading the code). The calls to .at() can throw if the returned geometries are not owned by the current MbP. I would think we should fix that acute problem (and cover it with a test) for general MbP uses?

xuchenhan-tri commented 1 day ago

Yes, I already have a Drake unit test confirming that. Working on a fix now.

If you are content with the scope of the issue being limited to just fixing the Drake issue (which I think makes sense), then I'd be happy to let you confirm the effect on underactuated later on by yourself.

xuchenhan-tri commented 1 day ago

OTOH, I'm slightly uneasy about the fact that we haven't figured out what triggered the failure from 1.30 to 1.34. From the perspective of MbP, the result reported by SG has changed (otherwise, the existing logic, which long predates 1.30, would fail on 1.30 too).

jwnimmer-tri commented 1 day ago

Please leave this alone until Monday when I can look at it.

Whatever the problem is here, it's probably not that MbP "assumes that it owns all of the geometry". Since forever, the plant has required that any geometry that can collide is registered via the plant. Any contacts the plant is fed on its query input port must be limited to only contain geometries registered to the plant, because the plant needs to know which bodies the collision dynamics should affect. We have speculated about ways to map from geometry id -> geometry frame -> MbT frame -> MbT body in lieu of plant-based geometry registration, but that has never been implemented. In short, the assumption about registration is not new.

The calls to .at() can throw if the returned geometries are not owned by the current MbP. I would think we should fix that acute problem (and cover it with a test) for general MbP uses?

It is supposed to throw. We must not silently ignore contacts.

We need to develop an understand of exactly what is happening before we start making changes. I will do that when I next have some time available.

jwnimmer-tri commented 1 day ago

One thing other people could do in the meantime is to improve the exception message text from the confusing map::at exception to instead be a Drake-geneated exception that includes more specifics.

RussTedrake commented 1 day ago

I agree that the method very likely should throw. What I said is that we should not throw by calling at() badly; ie that we should improve the exception message.

jwnimmer-tri commented 1 day ago

I got the impression that @xuchenhan-tri was proposing a change to the logic, rather than the message. If the only change in flight is to the message, then please go ahead!

xuchenhan-tri commented 1 day ago

Any contacts the plant is fed on its query input port must be limited to only contain geometries registered to the plant.

That is not the case at the moment. QueryObject computes all contact pairs among geometries that have proximity roles, and MbP just consumes as if all geometries are from the plant.

All I've done so far is reproducing the failure in Drake. (https://github.com/xuchenhan-tri/drake/tree/non-mbp-geometry-id)

I can morph that into a fix on the error message, PR, and then hand the issue to Jeremy.

jwnimmer-tri commented 1 day ago

Any contacts the plant is fed on its query input port must be limited to only contain geometries registered to the plant.

That is not the case at the moment. QueryObject computes all contact pairs among geometries that have proximity roles, and MbP just consumes as if all geometries are from the plant.

Right. Adding geometries that have a proximity role without going through MbP's is an illegal use of the plant, but perhaps was not flagged previously (and in any case needs a better error message). My modal hypothesis of the root cause here is that the notebook is adding the proximity role to what's supposed to be illustration-only animation geometry.

jwnimmer-tri commented 1 day ago

Bisecting points to the output port sampling (#21623) as the first failing commit.

Let me know if anyone has a better way to narrow down the culprit commit.

Here's the script I used to vet a Drake revision during the bisect.

#/bin/bash

set -exuo pipefail

cd /home/jwnimmer/tmp
# python3 -v venv env
# env/bin/pip3 install underactuated
# env/bin/pip3 uninstall drake

rm -rf /home/jwnimmer/tmp/install
(cd /home/jwnimmer/jwnimmer-tri/drake && bazel run //:install -- /home/jwnimmer/tmp/install)

env PYTHONPATH=$(pwd)/install/lib/python3.10/site-packages \
    env/bin/python3 gcs_quadrotor.py

For the python runtime environment, it uses underactuated from PyPI but drake from a source install.

xuchenhan-tri commented 1 day ago

Adding geometries that have a proximity role without going through MbP's is an illegal use of the plant.

In Russ's case, the geometry with proximity role is added through MbP, just not the one used in the Simulator.

Thanks for the script. I forgot about installing locally!

jwnimmer-tri commented 1 day ago

Probably I should also clarify for the record that installing from source isn't exactly the same as installing from wheel. If the bisect in question might be sensitive to the build flavor, then something like bazel run //tools/wheel:builder -- --python=310 0.0.0 would be the command we use to make a wheel from the current git sources. However, that method has higher overhead (slow), so I went with a plain source build this time since the error seemed unlikely to hinge on the build flavor.

jwnimmer-tri commented 13 hours ago

The simulator crashes a little bit after the 0.5s mark. Trying to get to the root cause, I edited underactuated's QuadrotorGeometry code so that the collision box in its quadrotor.urdf had its proximity role removed, and instead an illustration role added (as a red primitive). That allows the animation to run to completion. Here is what it looks like this during trajectory replay, near the point where the map::at would throw:

So it seems like the collision box might have been clipping the window frame opening? Let's turn off illustration geometry and look at the proximity geometry only:

That looks like the window opening in the proximity geometry is on wrong side. I believe the wall model files have some bug(s).

I am still working to understand exactly what is new during output port sampling that's coming into play here, and why it was not a problem previously.

jwnimmer-tri commented 13 hours ago

Aha, I found the piece I was missing. This example is using TAMSI for its discrete updates, which has an early exit when there are no dofs:

  // Quick exit if there are no moving objects.
  if (nv == 0) return;

Even if there are contacts on the input port, and even if the input contacts are nonsense, TAMSI doesn't try to solve for them. Only later in the discrete update (when the discrete step memory tries to pre-compute the contact results output port) does the plant notice that the input contacts are nonsense. SAP does not have an early-exit, which had me stuck for a while trying to figure out why the solve wasn't crashing but the port memory was.

I am not really keen to try to fix anything more with TAMSI these days, so I think the only change we need here is fixing the error message to be more clear.

(We should also finish up changing our default solver from TAMSI to SAP, but that is already tracked elsewhere.)

jwnimmer-tri commented 13 hours ago

To get underactuated working again, either:

(1) the QuadrotorGeometry class should remove (or avoid adding) the proximity role to its geometry, or

(2) the QuadrotorGeometry (or is caller) should configure SceneGraph to filter out collisions between the quadrotor and anything/everything else.

RobotLocomotion / drake