david-german-tri commented 7 years ago

Problem Statement

Since #3455, Systems within a Diagram "pull" on their input ports instead of reading a precomputed value. As a result, upstream outputs are frequently recomputed with the exact same Context. This is the most pressing System2 framework performance issue right now.

Background

We would like only to compute an upstream output if that computation will produce a different value than the one we already have. A given System's Context can keep track of whether any time, input, state, or parameters values have actually changed. However, even if none of those things have changed, we cannot use a previously computed output with confidence, because it is possible that a state change in an upstream system would cause that system to produce a different output, eventually affecting the input of the System in question.

Detailed Design

To resolve this, whenever any change occurs in Context A that might affect an output of System A, each Context that depends on that Context A must be notified that its input is no longer valid. That Context, in turn, must determine whether its own outputs are invalid as a result, and if so continue recursively propagating the invalidation signal.

As described, this algorithm still scales poorly: if every state in the Diagram is invalidated simultaneously, as typically happens in a simulation timestep, it will produce O(N^2) notifications in the number of Systems, when only O(N) notifications are really necessary. To resolve this, @sherm1 has suggested that every invalidation event has a unique identifier, which is recorded alongside the invalidated quantity. Whenever the recursive invalidation sweep hits a quantity that bears the current identifier, it will stop. Thus, no quantity is ever invalidated twice in the same event, and complexity is bounded above by O(N).

The existing Cache infrastructure, once extended with invalidation IDs, has the right kind of notation for this problem. Each leaf System will have a cache entry for each Context field, and for each output port. By default, the output ports will depend on the entire Context. Accessing a mutable pointer to a leaf Context field will trigger invalidation of all its dependents, generating a new unique ID for the event. Similarly, accessing a mutable pointer to Diagram state will trigger invalidation on the corresponding state component in every subsystem, passing in a shared unique ID to each.

If, as the result of a leaf invalidation event, any of the cache entries corresponding to that System's output ports are newly invalidated, propagation occurs. Propagation will be implemented by asking the parent (Diagram) Context to invalidate inputs on the downstream Contexts.

Work Plan

The design leads to the following list of tasks. I already have local prototypes of some of these tasks; there is about 60 more hours of work to do to make it all production-ready.

[ ] Implement a unique ID dispenser. These IDs currently only need to be unique per-process, but could be extended to be unique per-system or globally unique in the future.
[ ] Add invalidation event IDs to the Cache API, and halt intra-Cache recursive invalidation on them.
[ ] Add virtual cache tickets to the Context for Context fields, and invalidate them on mutable accesses.
[ ] Transmit invalidation from mutable access on DiagramContext to subcontexts.
[ ] Forbid changing the number of inputs or outputs after the Context has been created.
[ ] Make Contexts aware of their corresponding System's output topology.
[ ] Make Contexts aware of their index within the parent DiagramContext.
[ ] Add virtual cache tickets to the Context for Outputs, and define a way for the System to specify the dependencies.
[ ] Implement output invalidation propagation, by making requests of the Diagram Context. Propagation must carry through the event ID, and must halt upon encountering it.
[ ] Rework the LcmSubscriberSystem to use overt state (see comments below).
[ ] Do not eval an output that is already valid.
[ ] (Future) Link the output cache entry dependency structure directly to sparsity APIs.

Alternative Considered: Revert #3455

In contrast with the above, the pre-#3455 push invariant allowed Systems and Contexts to assume their inputs were already up to date. This invariant was achieved by traversing the Diagram to evaluate outputs in sorted order. While this approach created a theoretical risk of computing some outputs unnecessarily due to branching in downstream Systems, it also offered a trivial O(N) upper bound on any computation, and a straightforward way to skip computations based on sparsity annotation. Moreover, the simple locality guarantee would have enabled caching outputs by inspection of the leaf Context, with no propagation bookkeeping at all, reducing the linear coefficient to near-zero when all outputs are fresh.

3455 could be reverted with no impact on System authors. It's easy to see this because #3455 itself essentially did not change any of the systems that existed at the time.

The one argument against this approach is the same argument that prevailed in the run-up to #3455: it might compute outputs wastefully, if a downstream System notionally depends on an output that it doesn't actually read. The pull architecture offers a stronger guarantee that an output is only computed if it is actually used. Since that concern remains as valid (or not) as it was back then, I don't think a revert is in the cards.

siyuanfeng-tri commented 7 years ago

Hey David, What about LCM subscribers? I see two problems with them, they don't currently have a notion of when a new message arrives, and they are threaded. The first one is probably not that hard to handle. I am not sure what's the best way to deal with the threading part. What happens when you get a new message half way through evaluation? It's probably not the best idea to through everything and start over again.

I actually like push better than pull just because it's easier, and it's sort of an user fault to have unconnected output ports that requires large amount of computation.

sherm1 commented 7 years ago

@siyuanfeng-tri, we have to think about message arrival in the larger context of a time-advancing simulation, where they appear as randomly-timed events. The timing has to be discretized properly so that the message's influence occurs between, rather than during, a continuous integration interval. That can be done by sampling or via the simulator's event handling mechanism, but should never be allowed to modify an evaluation-in-progress.

SeanCurtis-TRI commented 7 years ago

The problem statement doesn't make it clear why "ids" are specifically necessary. Why not just a dirty bit? This id mechanism suggests that a system whose state is invalidated by one event, and never updates its values, could be invalidated again by another event, even though it was already marked as "dirty".

siyuanfeng-tri commented 7 years ago

@sherm1 I agree with you that there should be some atomic access per simulation / control tick, but I don't think that's the case at the moment. Independent of simulation, the controller is triggered essentially whenever there is a new incoming state (msg), and I am not sure how that interacts with the event handling (I am not running a simulator, but I suppose I could).

david-german-tri commented 7 years ago

The problem statement doesn't make it clear why "ids" are specifically necessary. Why not just a dirty bit? This id mechanism suggests that a system whose state is invalidated by one event, and never updates its values, could be invalidated again by another event, even though it was already marked as "dirty".

@SeanCurtis-TRI Yes, exactly. Abstracting away inputs and outputs for the moment, suppose cache value C depends on cache value B, which depends on cache value A, and all three cache values are valid. Then the following sequence of events occurs:

The value of A changes (invalidation event 1), so B becomes invalid. Because B becomes invalid, C becomes invalid.
The value of C is recomputed, but due to an if statement on some other variable, B doesn't have to be recomputed right this instant. Thus, B is still invalid.
The value of A changes again (invalidation event 2), so B becomes invalid. If we only had a dirty bit, we would stop here, but that would incorrectly leave C valid! The ID allows us to detect that B was last invalidated on event 1, and this is event 2, and therefore B becomes invalid (again), and therefore C becomes invalid.

david-german-tri commented 7 years ago

@siyuanfeng-tri Wow, thanks for calling that out. I hadn't considered out-of-thread invalidation events, and LCM subscribers are a perfect motivating example. I think you and @sherm1 are right that those invalidations need to be synced to the timestep; event handling seems like a reasonable way to do it. Basically:

LCM message arrives and gets written to hidden state out-of-thread.
The next time CalcNextUpdateTime is called, the system requests an update at the next tick of some fixed rate.
In the update, the hidden state gets copied to overt state, leading to invalidation.
The overt state determines the output.

I've added it to the work plan in the OP.

sherm1 commented 7 years ago

I'm sad to be revisiting this at this late date, since we already made this decision months ago. To dredge up some earlier discussions regarding push vs pull:

Push does not replace pull. We have always agreed that pull is required to correctly handle the cache that lives within a leaf system (e.g. acceleration depends on articulated body inertias which depend on position kinematics but not time, etc.). Push is an additional mechanism to handle computations that involve input and output ports. The pull proposal is to use a single mechanism to handle both system-internal and output-to-input port dependencies. This can be done with a single implementation.
The push promise is that for any computation we might perform in a leaf system, its input ports will already be valid. I believe that this leads to a lot of unnecessary computation. For example, guard functions need be evaluated only at step start and end, while accelerations are evaluated multiple times per step (5 times in an RK4, for example). When I evaluate the guards, how does push know whether inputs required for accelerations need be evaluated? Some guards depend only on kinematics (e.g. signed distance) and other on dynamics (e.g. reaction forces). Which guards are active changes dynamically in an event handler when a change in contact status occurs (for example). Computations in event handlers are done even less frequently but can be arbitrarily input-dependent. Or, say I want to Publish() something when the system is doing something interesting. I want my Publish() method called periodically; I look at the Context; I like it, so I perform some input-dependent computation and publish it. How did push know I was going to need that? It doesn't, so at best it computes the inputs every time just in case. Pull would evaluate the input when needed.
The invalidation scheme for pull (which is needed inside systems anyway), simply clears a bunch of downstream bits when an upstream value changes. Clearing bits is very, very cheap; it does not matter if some bits get cleared that didn't need to -- that is not like calculating articulated body inertias or reaction forces that aren't needed (and have been made available on outputs).
Also I am unclear about how push would deal with multirate Diagrams. The 1ms update and 100ms update would use different subsets of the inputs (both discrete and continuous); how does push know which ones? What about irregular (calculated next sample time) computations? Pull handles these automatically because evaluations occur as a result of the sampling action.
My final argument is the difficulty of reasoning about the push approach. Does it address the situations I mentioned above and all the other computations we haven't yet implemented? Maybe so, but I can't prove it to myself, and I think it has to be argued case by case. The pull scheme works for any computation, and computes exactly what's needed and only what's needed. It is the same scheme as used with great success in VTK, and it will work for us.

I would be delighted to implement this scheme myself and subject it to our usual vigorous review!

sherm1 commented 7 years ago

FTR, here are the arguments David and I wrote when we decided on pull a few months back. I don't think anything has changed since then.

SeanCurtis-TRI commented 7 years ago

@sherm1 FWIW, I agree with you. The question of making sure messages get published in a meaningful sequence is independent on whether dependency evaluation is pushed or pulled. It is about making sure messages get interleaved with time advancement in a sane and reasonable manner.

@david-german-tri I find your example a bit disconcerting. It seems you're trying to capture a sense of "weak" dependency. I.e., the current output value may or may not depend on a particular input. This uncertainty arises from the fact that the evaluation of one value has multiple code paths, at least one of which will not use a particular input and, conversely, one code path will. Your dirty integer is attempting to help capture this fine level of granularity. I've never seen this type of weak dependency accounted for. Generally, the systems I see define dependency at a much more coarse level: if a function is expressed in terms of an input, it depends on it, even if current state follows a code path that doesn't touch it. It seems to me that adding this "weak" dependency adds complexity to the code, making it harder to understand or to assert its correctness. Do we have multiple concrete examples that suggest that this approach is necessary or is this being implemented based on a principle?

sherm1 commented 7 years ago

@SeanCurtis-TRI: the alternative to using an "invalidation tag" is to ensure that any computation that is requested always validates its registered prerequisites, even if it doesn't need them in a particular circumstance. Then invalidation can stop as soon as an already-invalid downstream entry is seen. Either of those approaches is fine with me because we can automatically ensure prerequisite validation since we know the dependencies. Avoiding a prerequisite validation is a microoptimization since as you said typically all the prerequisites are needed. But either way works.

SeanCurtis-TRI commented 7 years ago

I had a chat with @sherm1 and he clarified some things (I'd gotten something in my mind pushed around a bit.) So, I'm going to summarize our conversation.

The problem (at its core) is that we have a pull system with cached values and we need to know when we can use the cached value and when we need to recompute it. If a node is "dirty", it needs to be recomputed (and set itself to be "clean"). If it is "clean", its cached value can be used as is. Our nodes are connected in a dependency graph such that if something changes in the system, "dirtying" a node, all downstream elements from that node should likewise be dirty. There is a coupling between how we propagate dirtiness and how we test and clear it. This coupling is most interesting when the evaluation of a node's output can vary based on state: i.e., in one case, it depends only on input A, in another, only on input B. Cases where the output always depends on all inputs are uninteresting for this conversation.

We have several options to solve this:

1) "Dirtiness" is captured by a single bit. When a node is "dirtied", it mindlessly pushes the dirty signal all the way down stream to all dependent nodes. On the evaluation side, it is enough to request values only from those upstream dependencies that are required in this particular evaluation.

Pros
- Dirt (no pun intended) simple to implement and get correct.
- No extraordinary effort in output evaluation required.
- No unnecessary computation is done on output evaluation (if a dirtied node's output isn't required, it will remain dirty and no computation will be done for that node.)
Cons
- If all nodes become dirtied at the same time, this can lead to an O(N^2) dirty propagation.

2) "Dirtiness" is captured by a single bit paired with a "timestamp". (This is a summary of what David explained more elegantly above.) In short, each time the system changes and dirtiness happens, a "timestamp" is associated with that event. The "timestamp" is a monotonically increasing value w.r.t. duration of the system. Dirtiness is pushed downstream with the corresponding time stamp. For each dependent node:

If it is clean, it is dirtied, saves the time stamp, and propagates the dirtiness downstream.
if it is dirty, but its timestamp is older than the current time stamp, its time stamp is updated and then propagates dirtiness downstream.
if it is dirty, but has the same timestamp, it is done propagating.
Pros
- No extraordinary effort in output evaluation required.
- No unnecessary computation is done on output evaluation (if a dirtied node's output isn't required, it will remain dirty and no computation will be done for that node.)
- Provides an O(N) dirty propagation pass.
Cons
- More complex to implement than the previous example. (Therefore more difficult to get right.)

3) Dirtiness is a single bit. Propagation starts at the introduction of the dirtiness and simply stops when it encounters a node that is already marked dirty. However, on the evaluation side, each node, prior to performing the output computation, must "touch" each of its inputs (forcing them to compute themselves if dirty), even if any particular input won't end up being used in the computation of the output.

Pros
- Dirt simple to implement downstream propagation.
- O(N) dirtiness propagation.
Cons
- Can lead to unnecessary computation (i.e., if a node is forced to make itself clean but its output value is not actually used by any down stream node.

Conclusion: 1) We favor an initial implementation of 1. Easy to implement and easy to confirm correctness. It guarantees that we won't compute any inputs unless we actually use them (a very desirable property.) The single greatest complaint against it is that it has O(N^2) dirty propagation. However, the constant on this N^2 is so small, and the diagrams we're currently working with are so small, that this particular cost will be lost in the noise. 2) Once things are in place with step 1, we implement 2 above, adding the timestamp: the optimization of the relatively cheap O(N^2) section of code.

sherm1 commented 7 years ago

Per discussion with David and Russ I am going to tackle this implementation. Please see #4364 for plan and status updates.

sherm1 commented 6 years ago

Closing as done now. See #9205 for remaining issues.

RobotLocomotion / drake

Cache intermediate outputs within a Diagram #4127

Problem Statement

Background

Detailed Design

Work Plan

Alternative Considered: Revert #3455

3455 could be reverted with no impact on System authors. It's easy to see this because #3455 itself essentially did not change any of the systems that existed at the time.