bytecodealliance / wasmtime

A fast and secure runtime for WebAssembly
https://wasmtime.dev/
Apache License 2.0
15.12k stars 1.26k forks source link

Cranelift: implement path-sensitive constant propagation #9049

Open cfallin opened 1 month ago

cfallin commented 1 month ago

In a discussion of egraph-framework extensions today, the idea of path-sensitive constant propagation, or propagating a constant-value fact through the dom-subtree below a branch that tests that value, came up. I wanted to record the idea, some issues it will run into, and some subsequent ideas I had for how to solve them.

In brief, the idea is that if we have a value x, it may not have a constant value overall, but past a branch v0 := icmp_imm eq x, 0; brif v0, ... we know that x == 0 on the taken branch (precisely, the target block and any blocks it dominates). We can thus do a form of cprop that is specific to that region of code. (There are some Spectre-safety implications here that we need to consider, but I'll gloss over for now.)

This is different than "classical" cprop because we don't have a constant fact about x; thus, we can't simply union it with iconst 0 in an eclass in the aegraph framework. Rather we need some notion of "true in this region".

The formulation of constant propagation in the aegraph framework provides both a nice property to help with this, and a significant problem:

To see why the latter, consider this example:

block1:
  v1 = ...
  v2 =  ...
  v3 = ishl v1, v2
  v4 = icmp_imm eq v2, 0
  brif v4, block2, block3

block2:
  return v3 ;; *should* rewrite to v1: we know only in this branch that `v2 == 0`, so `v1 << 0 == v1`

block3:
  ... ; left-shift-by-nonzero case; no further rewrites here

We process v3, and do all rewrites we can; this doesn't include the key simplification for left-shift-by-zero, because the shift amount isn't constant-zero at this program point. We have a final eclass for v3. Then, at some later time, we process the branch, we enter the subtree at block2, and here we have an additional assumption v2 == 0. What do we do?

We can union v2 with iconst 0; but then there may be an arbitrary number of other expression nodes built on top of v2 that we also should rebuild, propagating the implications of that assumption through. Furthermore we might even have later branches with other values (say v2 == 1); generally, we may have an arbitrary number of "sub-contexts" in which we do rewrites on the same expression nodes over again. And we have to process these later -- visiting and re-visiting the nodes over and over again -- because the branch assumptions that cause further implications may come only deep in the domtree, after we've already eagerly processed all of these expression nodes at their original appearance.

To say it another way: dominance, use-before-def processing order, and acyclicity do not save us here, because the branches are inputs to the rewrite reasoning but come much later than the things they affect. (Someone mentioned during this discussion that it had to do with pure enodes not having a notion of dominance or location, but this would be true even if they did.)


What to do? I don't think adding back parent pointers to the aegraph is the right solution here: that implies mutating the original knowledge base, then we have to unwind it when we leave the subregion.

I also briefly considered making all nodes in a given subtree dependent on the closest dominating branch (or just an identifier for the domtree node), but I suspect this would cause significantly more harm in compile-time blowup: it would mean we can no longer amortize the storage and work of optimizations on pure nodes in the common case when branch conditions don't affect them.

One solution may be to add a notion of context to the original value -> eclass mapping (so we have a mapping from "original value in this context" tuples to eclass handles). Then probably we shouldn't eagerly compute rewrites in all possible contexts; rather we should lazily query and memoize.

In the above example imagine we have some context c0 at the root, and c1 at the dom-subtree at block2. Then when we use v2 at the return we can query the hashmap for (c1, v2); if not present, re-process in that context. We can short-circuit most cases: a value is different in a context and requires a reprocessing if the context updates its mapping directly (e.g. c2 updates v1 to the eclass union'd with 0) or if any of its args require reprocessing. Otherwise, pass through from parent context. Perhaps there's also a way to filter processing down to only values that feed into any control dependency and their downstream dataflow (compute with prepass).

That's a pretty handwavy pointer to a possible solution but I at least wanted to record the major issue/question to ensure there is no confusion! (cc @jameysharp, @fitzgen, @elliottt, @avanhatt, @mwillsey)

avanhatt commented 1 month ago

Thanks for writing this out!

The concrete example with the ishl before the branch makes the challenge here more obvious (though I think it has a typo: it should be return v3; before the rewrite, right?).

I like the general idea of including the context in the eclass mapping. I don't fully follow the "making all nodes in a given subtree dependent on the closest dominating branch" alternative (dependent in what sense? How would that revisit v2 in this case?). That's probably fine though if we think the (context, value) key idea is a better one anyway.

cfallin commented 1 month ago

Typo fixed, thanks!

(The example I was trying to describe yesterday did have the ishl up at the top as well; maybe that wasn't clear and why folks were so confused...)

dependent in what sense? How would that revisit v2 in this case?

It's a little handwavy but the idea would be to make the branch an actual input to the operator -- or, when the def is above the branch, have another "make control dependent" pseudo-op that takes the original value and the control input, I guess. Now that I game that out further I like it even less so I think the context-dependent lookup table is definitely the right way to go.

jameysharp commented 1 month ago

Thanks for writing this up! I was so confused yesterday but I think I get it now.

It might eventually be useful to also allow path-dependent optimizations based on instructions which are not block terminators, although we'd need other changes before that would matter. I think that probably still works with the "context" approach.

Also, there's something I think is interesting about what patterns we can collect information from around branch instructions. This:

  v4 = icmp_imm eq v2, 0
  brif v4, block2, block3

could instead be written like this:

  brif v2, block3, block2

Certainly the same path-dependent optimizations should apply to this version, and similarly to br_table for everything but the default case; this form seems easier to reason about. Ideally we'll do that rewrite in the mid-end someday. Our mid-end notion of "truthiness-preserving" operations, including the related instruction icmp_imm ne x, 0, should let us rewrite those to this bare brif form as well.

But I'm trying to figure out how to generalize the reasoning that applies to the version you wrote for non-zero constants. There's some kind of backward range analysis here, I think? Within the "false" block we know the branch condition was 0 so we can propagate that backward to the result of the icmp; if that instruction was icmp_imm ne x, 23 then knowing it produced 0 allows us to conclude that x is 23. That in turn might propagate further backward and give us more constraints that we can conditionally assume. To do the same for the "true" branch in general we need an abstract domain supporting at least "this value is non-zero", such as a value range analysis, although we could go for something more specialized.

I think it's interesting that this is exactly the same analysis and reasoning we want to apply to instructions within the scope of the conditional branch, only propagated backward from results to arguments.

mwillsey commented 1 month ago

@altanh were just discussing this today, and it seems very connected to the fact that, despite ISLE being capable of top-down rewriting (if you allow recursive rules), Cranelift exclusively uses it in a bottom up way. This fundamentally means that doing something like path sensitivity will require that you either re-compute something, or pre-compute which context you'll be rewriting something under. In the setting of the above example:

v4 = icmp_imm eq v2, 0
brif v4, block2, block3

If you've already rewritten block2, you're pretty much out of luck; you'll have to redo because now you can do it with "new knowledge". If you instead did things top down:

(rule (simplify (brif v4 block2 block3))
         (let
            ((b2 (with v4 (simplify block2)))
             (b3 (simplify block3)))
            (brif v4 b2 b3)))

Here, with is an external constructor whose effects are scoped. This would likely involve an extension to ISLE (if you wanted to do it at the ISLE level, not necessary though). The semantics are to somehow "process" the boolean v4 into new facts in the e-graph (only new unions at the moment, but could extend to range analysis for inequalities), and then evaluate (simplify block2) in the context of those new assumptions. The scoping aspect is necessary; you'd need to "roll back" the hashcons map to avoid leaking the fact that e.g. x = 0 or something. I agree with Chris that the functional nature of the aegraph data structures will help here.

Even if you do all this outside of ISLE, you still need to ensure the same thing: that block2 and block3 are simplified (1) after v4, that way you actually get to learn something useful, and (2) they are simplified in separate "contexts" from one another, i.e., that the assumption of v4 doesn't leak into the simplification of block3.

re: how to solve this with a data structure, I think the way ISLE operates presents an opportunity to solve this problem in time rather than space. So simplification is only every executed in one context at a time (so you don't have to keep track of as much), but you need the ability to "retract" any assumptions when you leave a context so you don't leak anything.

fitzgen commented 1 month ago

@mwillsey When doing rewrites and populating e-classes, we already process blocks in an order that ensures that we visit dominating blocks before dominated blocks, defs before uses, and all that. We do rewrites as a forwards pass, in a pre-order traversal of the dom tree.

The twist is that our e-classes all represent pure instructions that are "floating" and aren't placed in the CFG skeleton until the "elaboration" pass that follows after the rewriting pass and after we've extracted the best e-node for each e-class.[^1] Elaboration tries to do code motion like floating value definitions up above loops (LICM) and down towards uses (to avoid partially-redundant computations). I say "tries to do" but it is more like "is implemented in such a way that these things naturally occur" because there is nowhere to push them up or down from because they are not in the CFG or located within any basic block until after elaboration.

[^1]: FWIW, this elaboration pass is also a domtree preorder traversal, IIUC, but we delay placing values until they are demanded by used by a side-effecting instruction. And we effectively make loops demand values, so that LICM happens automatically, when possible.

This separation between passes is nice, because we get all those code motion optimizations "for free" compared to if we were eagerly placing the best rewrite of a value at the same location as the original value while we did the rewrite pass. But it also means that if we aren't careful and keep track of the relationship between scopes and e-classes somehow across the two separate passes rather than just via pushing and popping scopes while we do rewrites, we might accidentally use the path-sensitive version of a value outside of its required scope.

If we did egraph rewriting and code placement in a single pass, where we eagerly placed the best rewrite of an instruction at the same location as the original instruction, then what you've sketched out would work, I believe. We would need to add a separate, dedicated code motion pass, but maybe that wouldn't be too bad; the biggest concern would be compile time overhead of new passes vs folding that work into the existing egraph pass(es). But maybe that isn't a big deal because we are removing a pass from the egraphs phase and adding a code motion pass afterwards, so it all evens out? No idea.

Does that make sense? Forgive me if you already understand all that and I misunderstood assumptions you were making in your comment.

cfallin commented 1 month ago

I think Max's suggestion is equivalent (as far as I can see) to switching from eager rewriting to a form of lazy rewriting, again. To restate what I think is the kernel of the problem with the example above: we see the expression for v3 (the shift that we conditionally rewrite differently in the different contexts) before we see the branch and even realize the separate context exists. There may be many such contexts deeper in the domtree.

I'm interpreting the functional-style "(with v4 (simplify block2))" as implying that within the dynamic scope of a context v2==0, we do rewrites on block2. But again block2 doesn't contain the operator that we want to rewrite differently; that operator was already seen above. So in an eager-eval world, we've already done all rewrites.

(Important side-note here: the operator doesn't have a location once it's in the egraph, but it does have an original location in the input code, and that's relevant for this processing order.)

If we defer applying any rewrites to a value until we use it, we could make use of all the context we have at the use-site, but many other aspects of the aegraph formulation have to change I think. At the very least, the GVN that naturally happens by deduping -- including deduping all of the nodes that are created as part of the rewrites -- breaks, because we will now only insert values into the scoped hashmap deep in the traversal.

We are still not fully back to the world of conventional egraphs with batched rewrites, because on first use (in a given context) we get the final eclass handle for that context; so it's still eager in some important sense; but "scoped-ly eager". We'd need to carefully consider how this interacts with GVN, elab, and our domtree-based auto-subsume.