RobotLocomotion / drake

Model-based design and verification for robotics.
https://drake.mit.edu
Other
3.27k stars 1.26k forks source link

Copying state from a Context for one MultibodyPlant-SceneGraph (MBP+SG) diagram to the State for a Context for another MBP+SG diagram can yield an exception #12563

Closed edrumwri closed 2 years ago

edrumwri commented 4 years ago

The exception that results when calling certain geometry-related methods on MBP (e.g., CalcContactResults()) after copying the state in this way does not yield any insight into the problem.

This seems to be a result of SceneGraph storing GeometryIds in the Context and MBP storing a mapping from BodyIDs to GeometryIds in itself (the LeafSystem).

cc @amcastro-tri @sherm1 @jwnimmer-tri

edrumwri commented 4 years ago

Argh. Was just bitten by this one again. I used Context::SetTimeStateAndParameters() to set the state for one MBP+SG to that of another one constructed identically. No indications of any problems but then a later call to MBP::CalcContactResults() with this "updated" context caused a std::map::at() exception.

SeanCurtis-TRI commented 4 years ago

Did you get any information beyond that it was a std::map:;at()? No location at all?

edrumwri commented 4 years ago

Sure- I didn't think it was that helpful, but here you go:

https://github.com/RobotLocomotion/drake/blob/ab093efb84c7da386d24677f43f8867fad3b9a72/multibody/plant/multibody_plant.cc#L1124

Anything else you'd like to know?

jwnimmer-tri commented 4 years ago

This sounds like a duplicate of #9501 and #12560 to me.

9501 because the developer's mental model is that we've created two functionally identical Diagrams that have only a pile of doubles as state (xc, xd), and therefore we should be able to copy the context's state from one context to another context without a problem because the state is trivial, but in reality there is some non-state ancillary data being stored as abstract state (SceneGraph's xa) that is incompatible with the copy.

12560 because SetTimeAndStateAndParametersFrom should probably only be allowed between Contexts created by the same System instance. As shown in the crash here, mixing Contexts with Systems that did not create them is brittle in general.

If we had a fail-fast during the SetTimeAndStateAndParametersFrom, then the developer could know to just copy over the piles of doubles, and then everyone is happy. (At least, insofar as the non-double state was in fact irrelevant.)

edrumwri commented 4 years ago

I agree. I was actually going to file this under #12560 but this issue came up first in my search and was directly relevant.

But. I also think that one should be able to copy state from one system to another if they were constructed using the same sequence of steps. In the MBP+SG case, we know that we can do this by copying only the MBP kinematic variables. But per #9501, that's undocumented behavior (and which is subject to change). If SetTimeAndStateAndParameters is only allowed between Contexts created by the same System instance, as agreed, then what would be the supported, non-brittle way to copy state from two identically constructed systems?

jwnimmer-tri commented 4 years ago

It might not be possible to safely copy abstract state (or parameters) between foreign contexts, in general. I will have to think about it some more.

To help that thought process, can we think of any examples (other than GeometryState) of wanting to copy xa or pa between identically-constructed systems? Grepping for DeclareAbstractState, perhaps something like the "most recently received LCM bytes" from LcmSubscriberSystem would come up in practice.

Perhaps what we lack is a way to label xa and pa as being tied to the specific system instance, as part of DeclareAbstractState? Then SetTimeAndStateAndParametersFrom could use as the guard for whether copying them is sensible.

Another way to do that would be to have Systems that use instance-tied xa and pa to imbue those abstract values with an instance-specific key, and fail-fast if asked to compute with the wrong key. At least that would be better than a segfault, though it defers failures to Calc time instead of SetFrom time.

sherm1 commented 4 years ago

The problem here is very specific to SG (currently). The global nature of GeometryIds causes them to differ between two identically-constructed systems, which is what makes GeometryState non-transferable. That's analogous to using pointers rather than indexes and prevents copying. I'm inclined to call that a design flaw and suggest that modelers think about designing abstract state variables for copyability. Might be easier said than done though.

edrumwri commented 4 years ago

I've been a little worried about these sorts of problems lately. Along this train of thought (perhaps another issue is in order), I've written a fair bit of code that has assumed that the MBP state is fully set through SetPositionsAndVelocities(.) and I'm sure I'm not the only one. If an auxiliary continuous variable is added to MBP, for example, lots of code is going to silently break. And if such a change will never be made, I've written lots of needless assertions to guard against it.

SetPositionsAndVelocities(.) is counter to the way that we operate on state in the rest of the API, which is not good, but it's behavior is clearer than the existing API methods since you understand exactly what it is you're manipulating.

SeanCurtis-TRI commented 4 years ago

The line you cited is very telling. You got collision for a geometry that MBP doesn't know anything about. Possibilities:

  1. Not every geometry in your scene is owned by MBP or
  2. The two contexts are not as identical as you think. You should be able to confirm that using SceneGraphInspector::GetAllGeometryIds().
SeanCurtis-TRI commented 4 years ago

Never mind....my response is redundant.

I agree, I really need to remove MPB's private knowledge of geometries. It should ask SG for everything it needs to know about geometries.

However, there may still be a problem based on FrameIds. MBP must remain a mapping between bodies and geometry::FrameId. I'm not familiar enough with your workflow to anticipate whether or not that would still get you.

edrumwri commented 4 years ago

However, there may still be a problem based on FrameIds. MBP must remain a mapping between bodies and geometry::FrameId. I'm not familiar enough with your workflow to anticipate whether or not that would still get you.

A sensible error message for all related issues would be much better than the status quo. I mentally recorded the "don't use a different Context with MBP" landmine only to hit it again under different circumstances. A grep indicates that there are only a handful of such at(.) calls that could blow up in MBP (though perhaps there are other methods/assumptions that could slip by), so a little bit of help for the user seems not too hard.

jwnimmer-tri commented 4 years ago

... suggest that modelers think about designing abstract state variables for copyability. Might be easier said than done though.

Since abstract state is our only relief valve for idioms not representable in the other more restricted kinds of state, I'm hesitant to impose the additional burden on it of being independent of the system that populated it. I would be happier if we could instead just fail fast(er) in cases where things break, either with more kinds of framework labels on the state, or perhaps just documenting that dependent state should always check for system compatibility before being used.

... A sensible error message for all related issues would be much better than the status quo ...

With a few internal visibility changes / new getters between SceneGraph and GeometryState, it seems to me like the SceneGraph constructor could save its state_value->self_source() key in a class member, and then in the geometry_state(const Context<T>&) method could check the GeometryState::self_source() is consistent with the saved key, and throw if not. Similarly it could pass that key to the QueryObject to do the same check. That would prevent any SceneGraph-related operation from operating on a context value that it did not create.

SeanCurtis-TRI commented 4 years ago

geometry_state(const Context&) method could check the GeometryState::self_source() is consistent with the saved key, and throw if not.

That's very reasonable.

However, there's an aspect of this whole scenario that confuses me. That suggests it's not so simple. This whole discussion arises from the belief that the cause of this is that constructing two topologically MBP+SG systems fails because their ids don't match. The evidence is that when an MBP learns about a collision it looks up the reported GeometryId in its own cached database in an unsafe manner and explodes because it's not there.

However, well before that happens, the MBP had to provide poses to SG as a prerequisite to the collision query. That will throw if the MBP's FrameIds don't match SG's knowledge of FrameIds. But if that's not throwing (with an intelligible message) then somehow MBP and SG are perfectly able to communicate lucidly about source ids, and frame ids (which have the same global properties as GeometryIds) but somehow it breaks down at the GeometryId level? I just don't see how that would happen.

I infer from the post from 16 hours ago that the contexts were generated by the two systems. Such that each context was appropriate to the system. That means there should be 100% coherency in the ids in each context locally.

I'll continue pondering to see if I can't think of something. But as I consider the problem, I don't think the problem is correctly characterized yet.

jwnimmer-tri commented 4 years ago

For sure, we should have failing code posted (linked) to this issue before we start working it. Hopefully @edrumwri could provide a simple example, or failing that we could try to repro it by making our own toy example.

edrumwri commented 4 years ago

I'll create a toy example within the next week.

jwnimmer-tri commented 3 years ago

Given the progress on #9501 and #12560, I wonder if there is anything left to do here.

rpoyner-tri commented 2 years ago

Given that this issue has been silent for 6 months with Jeremy's question, I'm going to just close it.