Consider optimizations for heavily-nested scenarios

lukewagner commented 2 weeks ago

Once this proposal gets to Phase 3 and we can study the performance of realistic workloads, I think it's important for this group to measure a workload that, e.g., uses stack-switching for both generators with green-threading where there are stacks of the form:

| green-thread-handler | generator-handler 0 | ... | generator-handler N | suspend to green-thread-handler |

Because generator nesting depth is controlled by the user program, this N can be arbitrarily-large (e.g., generators can be called recursively) and a naive implementation will require an O(N) search when switching between green threads.

Assuming that the cost of this O(N) search isn't somehow amortized away, it seems like we should work out a predictable optimization that engines can perform so that, in practice, suspension to the green-thread-handler is O(1) (like a native engine would do). Although optimizing the O(N) case in general doesn't seem possible, identifying a subset of cases that cover the above scenario does seem imminently possible. However, I think it's important for us to have a proper shared discussion about what this predictable optimization is before declaring this feature finished so that (1) we can publish it in some form and continue to claim that wasm has predictable performance, (2) if there is some small tweak or addition to the proposal that would make the optimization significantly simpler or more effective, we still have the opportunity to make that tweak.

lukewagner commented 2 weeks ago

Just to seed the discussion with a concrete idea, here's a simple optimization that might be sufficient to cover the generator+green-thread case:

In the execution context, maintain a possibly-empty, 1-element "switch-cache" with two fields -- a tag $e and a pointer to the innermost handler with (on $e ...) where the ... much be switch -- by doing the following:

When entering wasm from the host, clear the switch-cache.
When performing a resume:
- If there is a switch handler, set the cache before resuming the continuation and clear the cache if the continuation returns. (EDIT: wrong, see next comment)
- If there is a non-switch handler whose tag matches the tag in the switch-cache: clear the switch-cache.
When performing a suspend or throw: clear the switch-cache if the stack search hits the handler in the switch-cache.
When performing a switch:
- If the switch tag matches the tag in the switch-cache: use the handler in the switch-cache.
- Otherwise, perform a stack search as usual and fill the switch-cache with the results of the search.

Obviously there are more complex scenarios (with multiple nested switch handlers with different tags or resume used to indirectly switch) that wouldn't be covered by this optimization and extensions to the optimization to handle such cases (with more overhead to maintain) so there's an interesting engineering tradeoff to be made. The difference between this optimization kicking in or not can be the difference between aggregate O(N) and O(N²) behavior, so I think it's important that we clearly spell out what optimizations a toolchain can rely on since, if I'm a toolchain author, this may significantly impact the compilation strategy I choose. If I can't rely on any such optimization, I may not even want to use directly-nested handlers at all (even when it would otherwise be the natural way to do it) if I also want to allow green threading and don't want O(N²) worst-case behavior.

tlively commented 2 weeks ago

If there is a switch handler, set the cache before resuming the continuation and clear the cache if the continuation returns.

This would only be correct if the resumed continuation did not include a deeper switch handler for the same tag. We have the same problem when switching to a continuation. To maintain the invariant that the cache always points to the innermost switch handler, we could record the innermost switch handler on each continuation and use it to populate the cache on switch and resume. Alternatively, we could simply record whether each continuation contains a switch handler and leave the cache cleared when switching to a continuation that contains one.

We should definitely have spec tests that exercise these edge cases.

If there is a non-switch handler whose tag matches the tag in the switch-cache: clear the switch-cache.

This part shouldn't be necessary since switch handlers and suspend handlers with the same tag do not interfere with each other. (See #78)

lukewagner commented 2 weeks ago

This would only be correct if the resumed continuation did not include a deeper switch handler for the same tag.

Oops, right; I was only thinking of the resuming a fresh continuation; thanks! (All the more reason to have these optimizations well-described and well-tested...) But yeah, saving this cache on the continuation itself makes sense as a fix.

This part shouldn't be necessary since switch handlers and suspend handlers with the same tag do not interfere with each other. (See #78)

Oh fantastic; I was hoping for exactly that and skimming the explainer for whether or not that was the case and thinking it might be a tweak to suggest to avoid the overhead of guarding for it all the time in this optimization.

WebAssembly / stack-switching

Consider optimizations for heavily-nested scenarios #75