TAMSI results cache hack doesn't work correctly when caching disabled

sherm1 commented 4 years ago

@siyuanfeng-tri ran into a problem in a complicated Diagram where the behavior differed with caching enabled or disabled. I was able to trace this to the TAMSI results cache -- if I disabled caching for just that cache entry, leaving all other caching enabled, everything worked fine. In fact, I could re-enable even that cache entry as long as I did it after an AdvanceTo(0), which serves only to process the time-0 unrestricted and discrete updates.

There is a known-hack for this particular cache entry, well-documented in the code here. TL;DR: the cache entry declares dependency only on discrete state xd but in fact depends on much more.

Tracing in the debugger and discussing with @amcastro-tri , this is what is happening:

Initially, one or more MBPlant input ports (actuation e.g.) is presenting a NaN value.
During initialization, TAMSI results are evaluated (during the publish phase). They are needed to publish contact results. That marks the TAMSI cache entry "up to date", but calculated with the bad initial value.
There is a time-0 periodic update in another subsystem that updates its discrete and abstract variables.
That update causes that subsystem to change its outputs, which feed back into MBPlant eventually, meaning one of MBPlant's input ports changed value (from NaN to non-NaN in this case).
The Simulator now needs accelerations at time 0. Those depend on TAMSI results which should have been sampled and held at zero. With caching ON, no MBPlant discrete variables have changed, so TAMSI result cache is still marked up to date as it should be.
The accelerations come out NaN since the results weren't updated, causing the problem Siyuan encountered. That is the correct framework behavior; the fix is to make sure the initialization produces non-NaN values (e.g. using an initialization event).
But with caching OFF, the TAMSI results are recalculated and see the modified input port values, causing a different outcome.

This violates the caching Prime Directive -- the answers should be unchanged regardless of whether caching is enabled.

sherm1 commented 4 years ago

After much discussion, Alejandro and I came to the conclusion that the current behavior with caching ON is correct, but with caching OFF the behavior is wrong (though in this case it seemed better). There are two independent problems here:

Siyuan's system should have produced a good value after the Simulator::Initialize() call but actually produced NaNs until the first discrete update at the start of the first step. In this case an initialization event could be used to properly initialize the system. (I think Siyuan already made this change as a workaround, but actually that is the correct fix.)
MultibodyPlant in discrete mode should depend on sampled t[n] and input ports u[n], rather than continuous ones t, u(t). The hack mentioned above uses the caching system to achieve that effect. With caching turned off, the MBPlant results now depend on u(t) and thus picked up the later non-NaN values rather than the initial values which it should have sampled.

To say this another way, when running MBPlant in discrete mode we want its outputs y to be discrete with governing equations

y[n+1] = y(t[n], x[n], u[n])

which is what we get currently with caching on. (Note that in Siyuan's example u[0]=NaN, hence the unwanted behavior.) But with caching off, we are getting

y[n+1] = y(t, x[n], u(t))

which is an undesirable hybrid (mixed discrete/continuous) output. (In the example u(t) had been fixed up to be non-NaN so the behavior appeared to be better. )

We are debating how to address this and welcome any ideas. The most obvious fix is that MBPlant in discrete mode should sample its inputs into internal discrete variables, then have the TAMSI calculation depend on those variables rather than directly on the input ports. That would have the same effect as adding a zero order hold in front of the inputs. That would produce the desirable effect that the system would give the same results with caching on or off, which seems like an important invariant.

Thoughts?

sherm1 commented 4 years ago

@siyuanfeng-tri here is the fix I suggest:

MotionPrimitive::MotionPrimitive(const multibody::MultibodyPlant<double>* plant)
    : plant_(*plant) {

  // These are being directly copied to output ports.
  x_command_index_ = this->DeclareDiscreteState(
      systems::BasicVector<double>(plant->num_multibody_states()));
  motion_summary_index_ = this->DeclareAbstractState(
      AbstractValue::Make<MotionSummary>(MotionSummary()));

  //     ... declare ports ...

  // Ensure above state variables are valid at the start of the
  // first step (i.e., after Simulator::Initialize()).
  this->DeclareInitializationUnrestrictedUpdateEvent(
      &MotionPrimitive::UpdateEverything);

  // Then update these variables periodically.
  this->DeclarePeriodicUnrestrictedUpdateEvent(1e-3, 0,
      &MotionPrimitive::UpdateEverything);
}

systems::EventStatus MotionPrimitive::UpdateEverything(
    const systems::Context<double>& context,
    systems::State<double>* state) const {
  UpdateLogic(context, state);
  UpdateCommand(context, state);
  UpdateMotionSummary(context, state);
  drake::log()->info("MotionPrimitive::UpdateEverything");
  return systems::EventStatus::Succeeded();
}

Note: in addition to the initialization event (which was the fix), I switched from the Old School "override the dispatcher" style to what all the Cool Kids do now -- declare their own event handlers. Here UpdateEverything() is just a private method whereas the DoUnrestrictedUpdate() dispatcher was an overridden virtual.

sherm1 commented 4 years ago

I edited the title and description to reflect the investigation. This example revealed several independent issues that need to be addressed:

MBPlant doesn’t work right with caching OFF (this issue).
The integrators should check state derivatives for NaN and complain (at least in Debug). That would have exposed the problem earlier I think.
I had to add a new debugging feature to the caching system to track this down (disable just one cache element at a time to see which one is broken). That's really useful and ought to be available for anyone to use.
I’m not sure how to make the need for initialization events more visible (or automatic). That is a very obscure requirement but logically necessary for discrete systems to function correctly.

I'll follow up on these.

siyuanfeng-tri commented 4 years ago

Thanks Sherm! I should have remembered the initialization events.. I probably wrote those, or at least pushed really hard for having those in the early days.... I thought MBP's mode (discrete / continuous) is independent of what else in in the diagram, is this not true?? Is the discrete mode you are talking about in the second post coming from TAMSI hack? I am a bit confused here.

amcastro-tri commented 4 years ago

You are right @siyuanfeng-tri, whether MBP is discrete/continuous is given at construction, independent of anything else in the parent diagram. As @sherm1 mentions above, what we'd like to model is y[n+1] = y(t[n], x[n], u[n]). Notice the very important subtlety here, the discrete input port (not continuous!) u[n]. We didn't understand this very well before, and when I implemented the contact results output port in discrete mode I "emulated" the discrete behavior of u[n] entirely motivated on a run-time performance fix. See more details in the in source comment here. My "hack" was to make the contact results port (which in turn yes, it depends on the TAMSI cache entry) dependent on the discrete state xd. Therefore this discrete input u[n] "emulation" only works with cache turned on. Let me see if I can say it another way (sry, this is difficult to explain), the TAMSI cache entry with cache turned on, emulates a new discrete state (as if instead of having a ZOH for u[n], which would entail a new discrete state for u, we have a new state for the TAMSI computation). When caching is off, this "emulated" TAMSI state is gone and now instead you get the new (hybrid) system y[n+1] = y(t, x[n], u(t)) (notice t, u(t) instead of t[n], u[n]!).

Hopefully that helps?

amcastro-tri commented 4 years ago

I realize that the comment in the "hack" should instead say S = S(t[n], x[n], u[n]). We do have x[n], our MBP discrete state. We'd need a new state for u[n]. I believe that should solve the problem on the MBP side.

siyuanfeng-tri commented 4 years ago

makes sense.

jwnimmer-tri commented 2 days ago

Fixed in #21623, as I understand it.

RobotLocomotion / drake

TAMSI results cache hack doesn't work correctly when caching disabled #12643