RobotLocomotion / drake

Model-based design and verification for robotics.
https://drake.mit.edu
Other
3.25k stars 1.26k forks source link

Cache intermediate outputs within a Diagram #4127

Closed david-german-tri closed 6 years ago

david-german-tri commented 7 years ago

Problem Statement

Since #3455, Systems within a Diagram "pull" on their input ports instead of reading a precomputed value. As a result, upstream outputs are frequently recomputed with the exact same Context. This is the most pressing System2 framework performance issue right now.

Background

We would like only to compute an upstream output if that computation will produce a different value than the one we already have. A given System's Context can keep track of whether any time, input, state, or parameters values have actually changed. However, even if none of those things have changed, we cannot use a previously computed output with confidence, because it is possible that a state change in an upstream system would cause that system to produce a different output, eventually affecting the input of the System in question.

Detailed Design

To resolve this, whenever any change occurs in Context A that might affect an output of System A, each Context that depends on that Context A must be notified that its input is no longer valid. That Context, in turn, must determine whether its own outputs are invalid as a result, and if so continue recursively propagating the invalidation signal.

As described, this algorithm still scales poorly: if every state in the Diagram is invalidated simultaneously, as typically happens in a simulation timestep, it will produce O(N^2) notifications in the number of Systems, when only O(N) notifications are really necessary. To resolve this, @sherm1 has suggested that every invalidation event has a unique identifier, which is recorded alongside the invalidated quantity. Whenever the recursive invalidation sweep hits a quantity that bears the current identifier, it will stop. Thus, no quantity is ever invalidated twice in the same event, and complexity is bounded above by O(N).

The existing Cache infrastructure, once extended with invalidation IDs, has the right kind of notation for this problem. Each leaf System will have a cache entry for each Context field, and for each output port. By default, the output ports will depend on the entire Context. Accessing a mutable pointer to a leaf Context field will trigger invalidation of all its dependents, generating a new unique ID for the event. Similarly, accessing a mutable pointer to Diagram state will trigger invalidation on the corresponding state component in every subsystem, passing in a shared unique ID to each.

If, as the result of a leaf invalidation event, any of the cache entries corresponding to that System's output ports are newly invalidated, propagation occurs. Propagation will be implemented by asking the parent (Diagram) Context to invalidate inputs on the downstream Contexts.

Work Plan

The design leads to the following list of tasks. I already have local prototypes of some of these tasks; there is about 60 more hours of work to do to make it all production-ready.

Alternative Considered: Revert #3455

In contrast with the above, the pre-#3455 push invariant allowed Systems and Contexts to assume their inputs were already up to date. This invariant was achieved by traversing the Diagram to evaluate outputs in sorted order. While this approach created a theoretical risk of computing some outputs unnecessarily due to branching in downstream Systems, it also offered a trivial O(N) upper bound on any computation, and a straightforward way to skip computations based on sparsity annotation. Moreover, the simple locality guarantee would have enabled caching outputs by inspection of the leaf Context, with no propagation bookkeeping at all, reducing the linear coefficient to near-zero when all outputs are fresh.

3455 could be reverted with no impact on System authors. It's easy to see this because #3455 itself essentially did not change any of the systems that existed at the time.

The one argument against this approach is the same argument that prevailed in the run-up to #3455: it might compute outputs wastefully, if a downstream System notionally depends on an output that it doesn't actually read. The pull architecture offers a stronger guarantee that an output is only computed if it is actually used. Since that concern remains as valid (or not) as it was back then, I don't think a revert is in the cards.

siyuanfeng-tri commented 7 years ago

Hey David, What about LCM subscribers? I see two problems with them, they don't currently have a notion of when a new message arrives, and they are threaded. The first one is probably not that hard to handle. I am not sure what's the best way to deal with the threading part. What happens when you get a new message half way through evaluation? It's probably not the best idea to through everything and start over again.

I actually like push better than pull just because it's easier, and it's sort of an user fault to have unconnected output ports that requires large amount of computation.

sherm1 commented 7 years ago

@siyuanfeng-tri, we have to think about message arrival in the larger context of a time-advancing simulation, where they appear as randomly-timed events. The timing has to be discretized properly so that the message's influence occurs between, rather than during, a continuous integration interval. That can be done by sampling or via the simulator's event handling mechanism, but should never be allowed to modify an evaluation-in-progress.

SeanCurtis-TRI commented 7 years ago

The problem statement doesn't make it clear why "ids" are specifically necessary. Why not just a dirty bit? This id mechanism suggests that a system whose state is invalidated by one event, and never updates its values, could be invalidated again by another event, even though it was already marked as "dirty".

siyuanfeng-tri commented 7 years ago

@sherm1 I agree with you that there should be some atomic access per simulation / control tick, but I don't think that's the case at the moment. Independent of simulation, the controller is triggered essentially whenever there is a new incoming state (msg), and I am not sure how that interacts with the event handling (I am not running a simulator, but I suppose I could).

david-german-tri commented 7 years ago

The problem statement doesn't make it clear why "ids" are specifically necessary. Why not just a dirty bit? This id mechanism suggests that a system whose state is invalidated by one event, and never updates its values, could be invalidated again by another event, even though it was already marked as "dirty".

@SeanCurtis-TRI Yes, exactly. Abstracting away inputs and outputs for the moment, suppose cache value C depends on cache value B, which depends on cache value A, and all three cache values are valid. Then the following sequence of events occurs:

  1. The value of A changes (invalidation event 1), so B becomes invalid. Because B becomes invalid, C becomes invalid.
  2. The value of C is recomputed, but due to an if statement on some other variable, B doesn't have to be recomputed right this instant. Thus, B is still invalid.
  3. The value of A changes again (invalidation event 2), so B becomes invalid. If we only had a dirty bit, we would stop here, but that would incorrectly leave C valid! The ID allows us to detect that B was last invalidated on event 1, and this is event 2, and therefore B becomes invalid (again), and therefore C becomes invalid.
david-german-tri commented 7 years ago

@siyuanfeng-tri Wow, thanks for calling that out. I hadn't considered out-of-thread invalidation events, and LCM subscribers are a perfect motivating example. I think you and @sherm1 are right that those invalidations need to be synced to the timestep; event handling seems like a reasonable way to do it. Basically:

  1. LCM message arrives and gets written to hidden state out-of-thread.
  2. The next time CalcNextUpdateTime is called, the system requests an update at the next tick of some fixed rate.
  3. In the update, the hidden state gets copied to overt state, leading to invalidation.
  4. The overt state determines the output.

I've added it to the work plan in the OP.

sherm1 commented 7 years ago

I'm sad to be revisiting this at this late date, since we already made this decision months ago. To dredge up some earlier discussions regarding push vs pull:

I would be delighted to implement this scheme myself and subject it to our usual vigorous review!

sherm1 commented 7 years ago

FTR, here are the arguments David and I wrote when we decided on pull a few months back. I don't think anything has changed since then.

SeanCurtis-TRI commented 7 years ago

@sherm1 FWIW, I agree with you. The question of making sure messages get published in a meaningful sequence is independent on whether dependency evaluation is pushed or pulled. It is about making sure messages get interleaved with time advancement in a sane and reasonable manner.

@david-german-tri I find your example a bit disconcerting. It seems you're trying to capture a sense of "weak" dependency. I.e., the current output value may or may not depend on a particular input. This uncertainty arises from the fact that the evaluation of one value has multiple code paths, at least one of which will not use a particular input and, conversely, one code path will. Your dirty integer is attempting to help capture this fine level of granularity. I've never seen this type of weak dependency accounted for. Generally, the systems I see define dependency at a much more coarse level: if a function is expressed in terms of an input, it depends on it, even if current state follows a code path that doesn't touch it. It seems to me that adding this "weak" dependency adds complexity to the code, making it harder to understand or to assert its correctness. Do we have multiple concrete examples that suggest that this approach is necessary or is this being implemented based on a principle?

sherm1 commented 7 years ago

@SeanCurtis-TRI: the alternative to using an "invalidation tag" is to ensure that any computation that is requested always validates its registered prerequisites, even if it doesn't need them in a particular circumstance. Then invalidation can stop as soon as an already-invalid downstream entry is seen. Either of those approaches is fine with me because we can automatically ensure prerequisite validation since we know the dependencies. Avoiding a prerequisite validation is a microoptimization since as you said typically all the prerequisites are needed. But either way works.

SeanCurtis-TRI commented 7 years ago

I had a chat with @sherm1 and he clarified some things (I'd gotten something in my mind pushed around a bit.) So, I'm going to summarize our conversation.

The problem (at its core) is that we have a pull system with cached values and we need to know when we can use the cached value and when we need to recompute it. If a node is "dirty", it needs to be recomputed (and set itself to be "clean"). If it is "clean", its cached value can be used as is. Our nodes are connected in a dependency graph such that if something changes in the system, "dirtying" a node, all downstream elements from that node should likewise be dirty. There is a coupling between how we propagate dirtiness and how we test and clear it. This coupling is most interesting when the evaluation of a node's output can vary based on state: i.e., in one case, it depends only on input A, in another, only on input B. Cases where the output always depends on all inputs are uninteresting for this conversation.

We have several options to solve this:

1) "Dirtiness" is captured by a single bit. When a node is "dirtied", it mindlessly pushes the dirty signal all the way down stream to all dependent nodes. On the evaluation side, it is enough to request values only from those upstream dependencies that are required in this particular evaluation.

2) "Dirtiness" is captured by a single bit paired with a "timestamp". (This is a summary of what David explained more elegantly above.) In short, each time the system changes and dirtiness happens, a "timestamp" is associated with that event. The "timestamp" is a monotonically increasing value w.r.t. duration of the system. Dirtiness is pushed downstream with the corresponding time stamp. For each dependent node:

3) Dirtiness is a single bit. Propagation starts at the introduction of the dirtiness and simply stops when it encounters a node that is already marked dirty. However, on the evaluation side, each node, prior to performing the output computation, must "touch" each of its inputs (forcing them to compute themselves if dirty), even if any particular input won't end up being used in the computation of the output.

Conclusion: 1) We favor an initial implementation of 1. Easy to implement and easy to confirm correctness. It guarantees that we won't compute any inputs unless we actually use them (a very desirable property.) The single greatest complaint against it is that it has O(N^2) dirty propagation. However, the constant on this N^2 is so small, and the diagrams we're currently working with are so small, that this particular cost will be lost in the noise. 2) Once things are in place with step 1, we implement 2 above, adding the timestamp: the optimization of the relatively cheap O(N^2) section of code.

sherm1 commented 7 years ago

Per discussion with David and Russ I am going to tackle this implementation. Please see #4364 for plan and status updates.

sherm1 commented 6 years ago

Closing as done now. See #9205 for remaining issues.