Tilt: a proposal for fast control flow

WebAssembly / design

WebAssembly Design Documents

http://webassembly.org

Apache License 2.0

11.4k stars 694 forks source link

Tilt: a proposal for fast control flow #33

Closed kripken closed 9 years ago

kripken commented 9 years ago

The goals of this proposal are

Keep representing control flow in a structured manner, as we already agree; the structure makes it compressible, easy to optimize on the client, and is nice for polyfilling too.
Reduce currently-existing control flow overhead, which we know exists in practice in solutions like Emscripten, and may be impossible to avoid (even if it is typically small).
Do so in a way that allows both (1) a super-simple VM implementation, which might not reduce that overhead, but shows this does not force undue complication on browsers, and (2) a clear optimization path to eliminating that overhead entirely.

This proposal introduces two new AST constructs, best explained with an example:

while (1) {
  if (check()) {
    tilt -> L1;
  } else if (other()) {
    tilt -> L2;
  } else {
    work();
  }
}
multiple {
  L1: { l_1(); }
  L2: { l_2(); }
}
next();

What happens here is if check(), then we execute l_1(), and if other() then we execute l_2(); otherwise we do some work() and go to the top of the loop. Conceptually, Tilt doesn't actually affect control flow - it just "goes with the flow" of where things are moving anyhow, but it can "tilt" things a little one way or another (like tilting a pinball machine) at specific locations. Specifically, if control flow reaches a multiple, then the tilt has an effect in picking out where we end up inside the multiple. But otherwise, Tilt doesn't cause control flow by itself.

In more detail, the semantics are fully defined as follows:

Define label as a "hidden" local variable on the stack, only visible to the next 2 bullet points.
tilt -> X sets label to a numerical id that represents the label X.
multiple { X1: { .. } X2: { .. } .. } checks the value of label. If it is equal to the numerical id of one of the labels X1, X2, ... then we execute the code in that label's block, set label to a null value (that is not the id of any label), and exit the multiple. Otherwise (if it is not equal to any of the ids), we just skip the multiple.

Note that Tilt does not move control flow to anywhere it couldn't actually get anyhow. Control flow still moves around using the same structured loops and ifs and breaks that we already have. Tilt only does a minor adjustment to control flow when we reach a multiple. The multiple can be seen as a "switch on labels", and the tilt picks which one we enter. But again, we could have reached any of those labels in the multiple anyhow through structured control flow (and picked which label in the multiple using a switch on the numerical id of the label).

The semantics described above also provide the "super-simple" implementation mentioned in the goals. It is trivial for a VM to implement that - just define the helper variable, set it and test it - and it would be correct; and control flow is still structured. But, it is also possible in a straightforward way to just emit a branch from the tilt to the right place. In the example above, that is perhaps too easy, so consider this more interesting example:

L1: while (1) {
  if (checkA()) {
    tilt -> L2;
  } else {
    if (checkB()) break;
    while (1) {
      work();
      if (checkC()) {
        tilt -> L3;
        break L1;
      }
    }
    never();
  }
}
multiple {
  L2: { print('checkA passed!') }
  L3: { print('checkC passed!') }
}

It is straightforward for the VM to see that tilt -> L2 will reach L2, and tilt -> L3 will reach L3 - note how we need a break after the tilt to achieve that - so it can just emit branches directly. The helper variable overhead can be eliminated entirely.

This idea is modeled on the Relooper algorithm from 2011. There is a proof there that any control flow can be represented in a structured way, using just the available control flow constructs in JavaScript, and using a helper variable like label mentioned in the Tilt semantics, without any code duplication (other approaches split nodes, and have bad worst-case code size situations). The relooper has also been implemented in Emscripten, and over the last 4 years we have gotten a lot of practical experience with it, showing that it gives good results in practice, typically with little usage of the helper variable.

Thus, we have both a solid theoretical result that shows we can represent any control flow - even irreducible - in this way, and experience showing that that helper variable overhead tends to be minimal. In the proposal here, that helper variable is, in essence, written in an explicit manner, which allows even that overhead to be optimized out nicely by the VM.

A note on irreducible control flow: As mentioned, Tilt can represent it, e.g.,

tilt -> L2;
while (1) {
  multiple {
    L1: {
      duffs();
      tilt -> L2;
    }
    L2: {
      device();
      if (done()) break;
      tilt -> L1;
    }
  }
}

This is not surprising as the relooper proof guarantees that it can. And we can also make that irredicuble control flow run at full speed (if the VM optimizes Tilt, instead of doing just the super-simple semantics as its implementation). Still, this is not quite as good as if we directly represented that irreducible control flow as basic blocks plus branches directly, since with Tilt we do need to analyze a little to find that underlying control flow graph. So something like proper tail calls, which can directly represent that graph, may still be useful, but it is debatable - at least a large set of cases of proper tail calls seem to be handled by Tilt (the cases of having heavily irreducible control flow), and without the limitations of proper tail calls like number of parameters. However, there is obviously a lot to debate on that.

In any case, putting aside irreducible control flow and proper tail calls, what Tilt is clearly great for is to eliminate any small amounts of helper variable usage, that occur often in practice, stemming from either small amounts of irreducible control flow (either a goto in the source, or a compiler optimization pass that complexified control flow for some reason), or reducible but complex enough control flow that the compiler didn't manage to remove all helper variable usages (it is an open question whether 100% can be removed in tractable time). Having Tilt in wasm would open the possibility for straightforward and predictable removal of that overhead.

To summarize this proposal, it can be seen as adding an "escape valve" to structured control flow, where code remains still structured, but we can express more complex patterns in a way that is at least easily optimizable without magical VM tricks.

That concludes the core of this proposal. I see two possible large open questions here:

The above cannot optimize indirect branches. Some interpreter loops really want that. I'm not sure if we want to get into that now. But it seems that a clear path is available - an "indirect Tilt" would use a variable instead of a label, together with a way to get the numeric id of a label.
In the proposal above, we define the semantics in a super-simple way. I like the simplicity and how it also shows how to implement the feature in a trivial way, so this should not slow down development of wasm VMs. But a downside to that is that it allows one to write nonsense like

tilt -> L1;
work();
L2: while (1) {
  more();
  tilt -> L2;
  if (check()) break;
}
multiple {
  L3: { never(); }
}

None of the tilts do anything; the multiple is ignored; just reading that makes me nauseous. But it is valid code to write, even if silly. It might be nice to get a parse-time error on such things, and it isn't hard. But to do so requires that we define what is errored on very precisely, which is a lot more detailed than the very simple (but fully defined!) semantics described above.

edit: that detailed description appears in https://github.com/WebAssembly/spec/issues/33#issuecomment-99312986

And that brings me to a final point here. We could in principle delay discussion of this to a later date. However, if we do want to add this later, and do want to error on such nonsense as the last example, then how easy and clean it is to define the semantics will depend on decisions that we do make now. In particular, the fewer control flow constructs we have, the easier and cleaner it will be; every construct may need to be involved in the description. Also, we might want to start out with simple constructs that make that easier (I have some ideas for those, but nothing too concrete yet).

In other words, what I am getting at is that if we design our control flow constructs as a whole, together with Tilt, then things will be much nicer than if we try to tack Tilt on later to something designed before it. For that reason, we might want to discuss it now.

Bikeshedding: Suggestions for better names are welcome. Another name I considered was slide -> X, as in "I totally meant to slide there".

pizlonator commented 9 years ago

Under what conditions does tilt allow the implementation do more than what it would have done if it jump-threaded any CFG path A->B->C where B is a switch on a variable, the variable is known to be a constant in A, and C is the destination that B would jump to when passed that constant?

-Filip

On May 5, 2015, at 3:30 PM, Alon Zakai notifications@github.com wrote:

The goals of this proposal are

Keep representing control flow in a structured manner, as we already agree; the structure makes it compressible, easy to optimize on the client, and is nice for polyfilling too. Reduce currently-existing control flow overhead, which we know exists in practice in solutions like Emscripten, and may be impossible to avoid (even if it is typically small). Do so in a way that allows both (1) a super-simple VM implementation, which might not reduce that overhead, but shows this does not force undue complication on browsers, and (2) a clear optimization path to eliminating that overhead entirely. This proposal introduces two new AST constructs, best explained with an example:

while (1) { if (check()) { tilt -> L1; } else if (other()) { tilt -> L2; } else { work(); } } multiple { L1: { l_1(); } L2: { l_2(); } } next(); What happens here is if check(), then we execute l_1(), and if other() then we execute l_2(); otherwise we do some work() and go to the top of the loop. Conceptually, Tilt doesn't actually affect control flow - it just "goes with the flow" of where things are moving anyhow, but it can "tilt" things a little one way or another (like tilting a pinball machine) at specific locations. Specifically, if control flow reaches a multiple, then the tilt has an effect in picking out where we end up inside the multiple. But otherwise, Tilt doesn't cause control flow by itself.

In more detail, the semantics are fully defined as follows:

Define label as a "hidden" local variable on the stack, only visible to the next 2 bullet points. tilt -> X sets label to a numerical id that represents the label X. multiple { X1: { .. } X2: { .. } .. } checks the value of label. If it is equal to the numerical id of one of the labels X1, X2, ... then we execute the code in that label's block, set label to a null value (that is not the id of any label), and exit the multiple. Otherwise (if it is not equal to any of the ids), we just skip the multiple. Note that Tilt does not move control flow to anywhere it couldn't actually get anyhow. Control flow still moves around using the same structured loops and ifs and breaks that we already have. Tilt only does a minor adjustment to control flow when we reach a multiple. The multiple can be seen as a "switch on labels", and the tilt picks which one we enter. But again, we could have reached any of those labels in the multiple anyhow through structured control flow (and picked which label in the multiple using a switch on the numerical id of the label).

The semantics described above also provide the "super-simple" implementation mentioned in the goals. It is trivial for a VM to implement that - just define the helper variable, set it and test it - and it would be correct; and control flow is still structured. But, it is also possible in a straightforward way to just emit a branch from the tilt to the right place. In the example above, that is perhaps too easy, so consider this more interesting example:

L1: while (1) { if (checkA()) { tilt -> L2; } else { if (checkB()) break; while (1) { work(); if (checkC()) { tilt -> L3; break L1; } } never(); } } multiple { L2: { print('checkA passed!') } L3: { print('checkC passed!') } } It is straightforward for the VM to see that tilt -> L2 will reach L2, and tilt -> L3 will reach L3 - note how we need a break after the tilt to achieve that - so it can just emit branches directly. The helper variable overhead can be eliminated entirely.

This idea is modeled on the Relooper https://github.com/kripken/emscripten/raw/master/docs/paper.pdf algorithm from 2011 http://dl.acm.org/citation.cfm?id=2048224. There is a proof there that any control flow can be represented in a structured way, using just the available control flow constructs in JavaScript, and using a helper variable like label mentioned in the Tilt semantics, without any code duplication (other approaches split nodes, and have bad worst-case code size situations). The relooper has also been implemented in Emscripten, and over the last 4 years we have gotten a lot of practical experience with it, showing that it gives good results in practice, typically with little usage of the helper variable.

Thus, we have both a solid theoretical result that shows we can represent any control flow - even irreducible - in this way, and experience showing that that helper variable overhead tends to be minimal. In the proposal here, that helper variable is, in essence, written in an explicit manner, which allows even that overhead to be optimized out nicely by the VM.

A note on irreducible control flow: As mentioned, Tilt can represent it, e.g.,

tilt -> L2; while (1) { multiple { L1: { duffs(); if (done()) break; } L2: { device(); } } } This is not surprising as the relooper proof guarantees that it can. And we can also make that irredicuble control flow run at full speed (if the VM optimizes Tilt, instead of doing just the super-simple semantics as its implementation). Still, this is not quite as good as if we directly represented that irreducible control flow as basic blocks plus branches directly, since with Tilt we do need to analyze a little to find that underlying control flow graph. So something like proper tail calls, which can directly represent that graph, may still be useful, but it is debatable - at least a large set of cases of proper tail calls seem to be handled by Tilt (the cases of having heavily irreducible control flow), and without the limitations of proper tail calls like number of parameters. However, there is obviously a lot to debate on that.

In any case, putting aside irreducible control flow and proper tail calls, what Tilt is clearly great for is to eliminate any small amounts of helper variable usage, that occur often in practice, stemming from either small amounts of irreducible control flow (either a goto in the source, or a compiler optimization pass that complexified control flow for some reason), or reducible but complex enough control flow that the compiler didn't manage to remove all helper variable usages (it is an open question whether 100% can be removed in tractable time). Having Tilt in wasm would open the possibility for straightforward and predictable removal of that overhead.

To summarize this proposal, it can be seen as adding an "escape valve" to structured control flow, where code remains still structured, but we can express more complex patterns in a way that is at least easily optimizable without magical VM tricks.

That concludes the core of this proposal. I see two possible large open questions here:

The above cannot optimize indirect branches. Some interpreter loops really want that. I'm not sure if we want to get into that now. But it seems that a clear path is available - an "indirect Tilt" would use a variable instead of a label, together with a way to get the numeric id of a label. In the proposal above, we define the semantics in a super-simple way. I like the simplicity and how it also shows how to implement the feature in a trivial way, so this should not slow down development of wasm VMs. But a downside to that is that it allows one to write nonsense like tilt -> L1; work(); L2: while (1) { more(); tilt -> L2; if (check()) break; } multiple { L3: { never(); } } None of the tilts do anything; the multiple is ignored; just reading that makes me nauseous. But it is valid code to write, even if silly. It might be nice to get a parse-time error on such things, and it isn't hard. But to do so requires that we define what is errored on very precisely, which is a lot more detailed than the very simple (but fully defined!) semantics described above.

And that brings me to a final point here. We could in principle delay discussion of this to a later date. However, if we do want to add this later, and do want to error on such nonsense as the last example, then how easy and clean it is to define the semantics will depend on decisions that we do make now. In particular, the fewer control flow constructs we have, the easier and cleaner it will be; every construct may need to be involved in the description. Also, we might want to start out with simple constructs that make that easier (I have some ideas for those, but nothing too concrete yet).

In other words, what I am getting at is that if we design our control flow constructs as a whole, together with Tilt, then things will be much nicer than if we try to tack Tilt on later to something designed before it. For that reason, we might want to discuss it now.

Bikeshedding: Suggestions for better names are welcome. Another name I considered was slide -> X, as in "I totally meant to slide there" https://i.chzbgr.com/maxW500/8165877248/h5E9BC18C/.

— Reply to this email directly or view it on GitHub https://github.com/WebAssembly/spec/issues/33.

kripken commented 9 years ago

I think that in a way we can summarize Tilt syntax as explicitly writing out where jump threading can be performed. It's very straightforward to optimize when written that way, but does not provide more power than a sufficiently smart jump threading could have achieved.

pizlonator commented 9 years ago

The jump threading rule that I described is easy to implement - in fact it’s one of the easiest. Note that it requires that block B in the A->B->C chain is only doing a branch/switch on something that is constant in A, and this can be deduced just from looking at the inputs to the Phi for the variable being switched on. If B does more than just that branch/switch, the threading would bail on the grounds that increasing code size isn’t worth it. Though, usually, compilers allow for B to do up to N other things, for some carefully chosen and usually small value of N.

So, I would object to adding Tilt syntax, if it can’t be demonstrated that Tilt profitably enables more jump threading than would be done by that rule. That is: one could argue that Tilt is a hint to do more jump threading than you’d already have done, but the counterargument is that the compiler will jump-thread when it knows it is profitable based on some simple code-size-versus-cost-of-branch rules - and so it would likely ignore the Tilt hint anyway.

Can you show a case where a compiler could do something with Tilt information that it could not have done if the exact same compiler rule was simply generalized to handle any case of a switch on an integer variable?

Note that I’m assuming that any webasm implementation will have to be sophisticated enough to support constant folding, CSE, LICM, basic CFG simplifications, register allocation, and instruction selection. If you have those things - which are all probably necessary just to clean up redundancies that are introduced during translation - then the jump threading rule I describe is trivial to add.

-Filip

On May 5, 2015, at 3:47 PM, Alon Zakai notifications@github.com wrote:

I think that in a way we can summarize Tilt syntax as explicitly writing out where jump threading can be performed. It's very straightforward to optimize when written that way, but does not provide more power than a sufficiently smart jump threading could have achieved.

— Reply to this email directly or view it on GitHub https://github.com/WebAssembly/spec/issues/33#issuecomment-99253379.

lukewagner commented 9 years ago

What's concerning about the jump-threading optimization is that it pushes us back more into the JS domain of heuristic optimization and away from one of the high-level goals of predictable near-native perf. Specifically:

We'd probably want to all agree on exactly what this reliably-jump-threaded pattern was.
LLVM would have to keep this pattern in mind to make sure it didn't accidentally break it in an optimization phase by, e.g., stuffing stuff into the 'B' block.
Non-LLVM tools would have to either discover the pattern on their own or we'd have to publish the pattern... so we already have asm.wasm :P
Baseline compilers w/o all those compiler optimizations would miss out and run slower than if there was a more direct control flow construct.

That being said, I think the same concerns apply, though to a lesser degree, to this tilt feature. I think the metaphor and comparison with jump-threading is valid.

When Alon brought up this idea in #27, I asked if we couldn't define a related, but simpler control-flow primitive. Paraphrasing it here: for each of the loop statements and Block we define a -WithExits version that contains a list of non-default exit statements. E.g.:

DoWhileWithExits(body, condition, [exit0, exit1, exit2])

To jump to one of these non-default exits, there'd be a Break variant BreakTo(#label, #exit) where #exit is the index of the desired exit in the -WithExits node identified by #label. Compiling this requires zero analysis; a baseline could generate the optimal control flow, and so it addresses my concerns above. It's also, I just realized, equivalent to a function-local, nullary version of exception handling where the exit paths aren't exceptional.

The response in #27 was: "I think WhileWithExits is not quite enough (need to be able to enter loops as well)", which I see is related to the Duff's device example above. However, Duff's device isn't reducible (multiple entries into loop) so I'm fine that we can't optimally represent that control flow pattern. (The knowing programmer should have peeled off the switch of Duff's device anyway ;) Are there any other reducible patterns that a -WithExits primitive couldn't cover and we would expect to be optimized (w/o use of implicit helper var) by the tilt feature?

An argument to all of this is that we shouldn't bother if what we have now is Good Enough and we have an immediately-post-v.1 goal to support general, efficient irreducible control flow. I can buy this argument and I'd be fine leaving things as is for v.1, but, fwiw, I think implementing the *WithExits ops would be super-simple -- just a minor extension of what we do for current break/continue. Some specific data showing prevalence of this pattern would help prioritize.

kripken commented 9 years ago

@pizlonator: I'm not sure either way if the simple loop threading you describe is sufficient. Would it be able to remove all appearances of label in the following example (both assigns, tests, and resets to 0)? If so, then it can probably handle the general case.

label = 0;
while (1) {
  if (check()) {
    label = 1;
  } else if (other()) {
    label = 2;
  } else {
    work();
  }
  switch (label) {
    case 1: {
      label = 0;
      more();
      break;
    }
    case 2: {
      label = 0;
      less();
      break;
    }
  }
}

@lukewagner: WithExits can't handle loops that occur right after the loop, which sometimes happens,

  ...
}
forever {
  multiple {
    ..multiple labels that reach each other..
  }
}

This would generally be due to irreducible control flow, so perhaps you don't care about it, although I see the relooper currently emits it in other cases as well (but those could be flipped so the loop is inside the multiple; I didn't check why the current heuristics prefer the current form, might be a code size thing in JS).

But I have a difference in opinion about irreducible control flow. I think if we have an easy way to make it predictably fast, then we should, and Tilt offers that. It's better than tail calls for that in some ways. Basically, with Tilt, we would know that an arbitrary CFG from LLVM can run predictably fast - I think that's a good thing!

I'd also be ok to delay this discussion til later. Looks like we agreed on the core 2 loop types, which would make defining Tilt or something else easier (the forever loop is important there).

lukewagner commented 9 years ago

I don't see how tilt offers any advantages over signature-restricted PTCs; they by definition compile down to jumps whereas tilt (necessarily) doesn't.

kripken commented 9 years ago

Tilt does compile to jumps. There are two ways to implement them, as mentioned above: the trivial way using a helper variable, and the expected way that does a small analysis which lets it emit jumps.

lukewagner commented 9 years ago

IIUC, even when the analysis is present, it depends on what happens in between the tilt statement and the multiple statement. For example, if there existed a path that executed another tilt statement or some effectful non-tilt statement, that would be require a helper variable. I guess your point would be "but the code generator wouldn't do that", which makes me rather uncomfortable since we're back in pattern territory.

pizlonator commented 9 years ago

On May 5, 2015, at 8:08 PM, Alon Zakai notifications@github.com wrote:

@pizlonator https://github.com/pizlonator: I'm not sure either way if the simple loop threading you describe is sufficient. Would it be able to remove all appearances of label in the following example (both assigns, tests, and resets to 0)? If so, then it can probably handle the general case.

label = 0; while (1) { if (check()) { label = 1; } else if (other()) { label = 2; } else { work(); } switch (label) { case 1: { label = 0; more(); break; } case 2: { label = 0; less(); break; } } } That’s a fun thought experiment! I did it below and I believe that the answer is “yes”.

It requires a bit more than just my jump threading rule - it also requires simple reasoning about the constant folding of Phi functions, using a special case of an existing SSA conversion algorithm. I believe that any webasm backend will have this, as well. Also, if you write a backend that doesn’t use SSA, you could achieve the same effect using SCCP.

But, can you also work it out the way you anticipate an implementation using tilt? I’d like to see what you have in mind. Because, I believe that to use tilt to restore the original CFG you would also have to have the same machinery as what I’m proposing, except it would be specific to tilt rather than generalizing over integers.

I’m going to skip over one step: the “breaks” in the switch statement will be jump threaded to jump to the loop header. This is a strictly simpler jump threading than what I’m proposing and I assume that jump threading runs to fixpoint. So, here’s the CFG with SSA after that simple change, so we can focus on the much harder jump threading that actually has to consider Phi’s over constants:

Lstart: label_0 = 0 jump Lloop

Lloop: // predecessors: Lstart, LswitchCase1, LswitchCase2, Lswitch label_phi_0 = Phi(label_0, label_3, label_4, label_phi_1) if check() then LsetLabel1 else LdontSetLabel1

LsetLabel1: label_1 = 1 jump Lswitch

LdontSetLabel1: if (other()) then LsetLabel2 else Lwork

LsetLabel2: label_2 = 2 jump Lswitch

Lwork: work() jump Lswitch

Lswitch: // predecessors: LsetLabel1, LsetLabel2, Lwork label_phi_1 = Phi(label_1, label_2, label_phi_0) switch (label_phi_1) { case 1 => LswitchCase1, case 2 => LswitchCase2, default => Lloop }

LswitchCase1: label_3 = 0 more(); jump Lloop

LswitchCase2: label_4 = 0 less(); jump Lloop

Now running jump threading for one iteration will yield the following simplifications:

LsetLabel1 jumps directly to LswitchCase1
LsetLabel2 jumps directly to LswitchCase2

This results in the following code:

Lstart: label_0 = 0 jump Lloop

Lloop: // predecessors: Lstart, LswitchCase1, LswitchCase2, Lswitch label_phi_0 = Phi(label_0, label_3, label_4, label_phi_1) if check() then LsetLabel1 else LdontSetLabel1

LsetLabel1: label_1 = 1 jump LswitchCase1 // this got turned into a jump

LdontSetLabel1: if (other()) then LsetLabel2 else Lwork

LsetLabel2: label_2 = 2 jump LswitchCase2 // this got turned into a jump

Lwork: work() jump Lswitch

Lswitch: // predecessors: Lwork label_phi_1 = Phi(label_phi_0) switch (label_phi_1) { case 1 => LswitchCase1, case 2 => LswitchCase2, default => Lloop }

LswitchCase1: label_3 = 0 more(); jump Lloop

LswitchCase2: label_4 = 0 less(); jump Lloop

Now observe that:

label_phi_1 is redundant. All uses get replaced with label_phi_0
label_phi_0 has label_0, label_3, label_4, and label_phi_0 as inputs. label_0, label_3, and label_4 are all the constant zero. Therefore, label_phi_0 can be constant-folded to zero. More generally: if you search the graph starting at a Phi and all you see are Phis and a single constant, then the Phi has that constant value. This is a special case of rule 2 in section 3.1 of John Aycock's and Nigel Horspool’s “Simple Generation of Static Single-Assignment Form” paper (http://bit.ly/1JpCQrS http://bit.ly/1JpCQrS).

This means that we can remove the switch statement, because its operand is the constant 0:

Lstart: label_0 = 0 jump Lloop

Lloop: // predecessors: Lstart, LswitchCase1, LswitchCase2, Lswitch label_phi_0 = Phi(label_0, label_3, label_4, label_phi_0) if check() then LsetLabel1 else LdontSetLabel1

LsetLabel1: label_1 = 1 jump LswitchCase1

LdontSetLabel1: if (other()) then LsetLabel2 else Lwork

LsetLabel2: label_2 = 2 jump LswitchCase2

Lwork: work() jump Lswitch

Lswitch: // predecessors: Lwork jump Lloop // this switch got replaced with a jump to the default case, i.e. Lloop

LswitchCase1: label_3 = 0 more(); jump Lloop

LswitchCase2: label_4 = 0 less(); jump Lloop

Now you can see that none of the label variables are used, and DCE can eradicate them. The jump from Lwork to Lswitch can then be jump-threaded to be a jump directly to Lloop. I believe that this constitutes the optimal CFG you were looking for?

@lukewagner https://github.com/lukewagner: WithExits can't handle loops that occur right after the loop, which sometimes happens,

... } forever { multiple { ..multiple labels that reach each other.. } } This would generally be due to irreducible control flow, so perhaps you don't care about it, although I see the relooper currently emits it in other cases as well (but those could be flipped so the loop is inside the multiple; I didn't check why the current heuristics prefer the current form, might be a code size thing in JS).

But I have a difference in opinion about irreducible control flow. I think if we have an easy way to make it predictably fast, then we should, and Tilt offers that. It's better than tail calls for that in some ways. Basically, with Tilt, we would know that an arbitrary CFG from LLVM can run predictably fast - I think that's a good thing!

I'd also be ok to delay this discussion til later. Looks like we agreed on the core 2 loop types, which would make defining Tilt or something else easier (the forever loop is important there).

— Reply to this email directly or view it on GitHub https://github.com/WebAssembly/spec/issues/33#issuecomment-99303373.

Begin forwarded message:

Date: May 5, 2015 at 8:13:22 PM PDT From: Luke Wagner notifications@github.com Reply-To: WebAssembly/spec reply@reply.github.com To: WebAssembly/spec spec@noreply.github.com Cc: pizlonator pizlo@mac.com Subject: Re: [spec] Tilt: a proposal for fast control flow (#33)

I don't see how tilt offers any advantages over signature-restricted PTCs; they by definition compile down to jumps whereas tilt (necessarily) doesn't.

— Reply to this email directly or view it on GitHub https://github.com/WebAssembly/spec/issues/33#issuecomment-99304390.

pizlonator commented 9 years ago

I agree that we don’t need this at all if what we have already is good enough. Still, I’d like to get to get to the bottom of what you guys are proposing. My objections are really a sincere desire to understand what you’re thinking.

Are you suggesting that Tilt should force the implementation to jump-thread no matter what, or are you suggesting something else?

If you force jump threading no matter what, then the Tilt feature is sort of a bizarre denial-of-service bomb, no? So, probably an implementation will still have a heuristic for when to clone code. It will have to detect explosion because, at a minimum, it would have to do this to preserve its own integrity.

But an implementation that is interested in being competitive will probably do the smarter thing: it would completely ignore the tilt hint if the code bloat is above some very small threshold, because in practice the cost of branching on a small integer value in a register is so tiny. The threshold at which tilt is profitable is likely to be identical to the threshold at which ordinary jump threading is profitable. Ergo, tilt is superfluous.

But maybe I have misunderstood! Maybe there is something about tilt that enables the compiler to restore the CFG back to its original, pre-relooping state, without code duplication. If that’s the case, then great! But I just don’t see it.

Specific comments inline:

On May 5, 2015, at 7:46 PM, Luke Wagner notifications@github.com wrote:

What's concerning about the jump-threading optimization is that it pushes us back more into the JS domain of heuristic optimization and away from one of the high-level goals of predictable near-native perf. Specifically:

We'd probably want to all agree on exactly what this reliably-jump-threaded pattern was. Maybe it’s too much of an implementation detail? LLVM’s got a lot of different jump threading pattern detectors and they have many different heuristics that are the subject of lively debate. In my last JVM bytecode JIT, I had a jump threading heuristic that I would retune annually or so because other changes in the compiler would affect the ratio of code bloat cost to branch cost.

But this should be reliably threadable: Block A sets some variable V to a constant K and jumps to B. Block B does nothing but a switch on V. The constant K that A uses is also one of the case values in the switch in B. That case value jumps to block C. Therefore, the replace the A->B jump with a A->C jump.

My preference would be to mention none of this in the webasm spec, except maybe non-normative text that warns of the possibility of “label” variables and the potential benefits of jump threading.

LLVM would have to keep this pattern in mind to make sure it didn't accidentally break it in an optimization phase by, e.g., stuffing stuff into the 'B' block. Does LLVM do things after the relooper runs? Or is the relooper in the backend? If LLVM can do things after the relooper runs, then couldn’t it also mess up the CFG so that it’s no longer wasm-compliant?

I guess I was assuming that the relooper ran after the dust had settled.

Non-LLVM tools would have to either discover the pattern on their own or we'd have to publish the pattern... so we already have asm.wasm :P Non-LLVM tools will soundly understand the code without knowing the pattern. Understanding the pattern is an optimization.

Doing jump threading is something I’ve done in static analysis in the past, precisely because the pattern that tilt covers is a pattern that occurs in user code. Classic example:

bool found = false; for (i = 0; i < stuff; ++i) { if (array[i] == thing) { found = true; break; } } if (found) blah

Baseline compilers w/o all those compiler optimizations would miss out and run slower than if there was a more direct control flow construct. It probably won’t matter because it won’t be enough of a slow-down, since the relooper won’t be using the “label” variable for everything. I thought we already established that the use of the “label” variable is uncommon.

Also, do you want baseline compilers to do jump threading? I wouldn’t. It appears that the Tilt proposal hints at this: you are right to treat Tilt as a “label” variable.

That being said, I think the same concerns apply, though to a lesser degree, to this tilt feature. I think the metaphor and comparison with jump-threading is valid.

Right.

When Alon brought up this idea in #27 https://github.com/WebAssembly/spec/pull/27, I asked if we couldn't define a related, but simpler control-flow primitive. Paraphrasing it here: for each of the loop statements and Block we define a -WithExits version that contains a list of non-default exit statements. E.g.:

DoWhileWithExits(body, condition, [exit0, exit1, exit2]) To jump to one of these non-default exits, there'd be a Break variant BreakTo(#label, #exit) where #exit is the index of the desired exit in the -WithExits node identified by #label. Compiling this requires zero analysis; a baseline could generate the optimal control flow, and so it addresses my concerns above. It's also, I just realized, equivalent to a function-local, nullary version of exception handling where the exit paths aren't exceptional.

This could be nice - but I suspect it’s superfluous if an implementation just does jump threading!

The response in #27 https://github.com/WebAssembly/spec/pull/27 was: "I think WhileWithExits is not quite enough (need to be able to enter loops as well)", which I see is related to the Duff's device example above. However, Duff's device isn't reducible (multiple entries into loop) so I'm fine that we can't optimally represent that control flow pattern. (The knowing programmer should have peeled off the switch of Duff's device anyway ;) Are there any other reducible patterns that a -WithExits primitive couldn't cover and we would expect to be optimized (w/o use of implicit helper var) by the tilt feature?

An argument to all of this is that we shouldn't bother if what we have now is Good Enough and we have an immediately-post-v.1 goal https://github.com/WebAssembly/spec/blob/master/EssentialPostV1Features.md#signature-restricted-proper-tail-calls to support general, efficient irreducible control flow. I can buy this argument and I'd be fine leaving things as is for v.1, but, fwiw, I think implementing the *WithExits' ops would be super-simple -- just a minor extension of what we do for currentbreak/continue`. Some specific data showing prevalence of this pattern would help prioritize.

— Reply to this email directly or view it on GitHub https://github.com/WebAssembly/spec/issues/33#issuecomment-99299631.

lukewagner commented 9 years ago

@pizlonator Indeed, I would not want the baseline to do jump threading, this point was mostly just leading up to the point that the -WithLoop primitive is imminently compilable w/ minimal analysis.

kripken commented 9 years ago

@lukewagner: Yeah, that's the point at the end of the proposal: to avoid yucky stuff like that, we'd need to define the semantics precisely, and make the yucky stuff a syntax error. Possible, and not that hard, especially if we've agreed on the 2 basic loop types already. In Tilt done that way, tilt -> X becomes a direct jump to X. Here in fact is a sketch of the full semantics:

When you see tilt -> X, start following "natural control flow", which is defined as follows:
- If you reach a break/continue, follow it (i.e. do the natural thing).
- If you reach the end of a forever loop, go to the start (i.e. the natural thing).
- If you reach a forever loop, enter it (ditto).
- If you reach the end of a block (e.g. in an if body) then go to the statement after the parent (ditto).
- If you reach a multiple:
- If it has X, then it worked! Emit a jump to X and you're done.
- If not, skip the multiple (i.e. the natural thing)
- Continue to follow natural control flow, applying the above rules iteratively, until it completes successfully ("it worked!"), or until you hit anything else (a statement, the end of a do-while, an if, etc.); if that happens, emit a syntax error, this was yucky.

Anything not causing a syntax error is jump-able; anything giving a syntax error should not have worked. (If I didn't miss anything... ;)

@pizlonator: Nice! Cool that LLVM can do it. It does hit the issues @lukewagner was raising though, of how much smarts we want to assume on the client VM. We'd lose predictability by relying on that specific chain of optimizations, which worries me.

Regarding whether a Tilt implementation could do the same, then I believe yes. Look at the more detailed semantics I wrote above for @lukewagner in this comment. That should work (I hope). Also, note how there is never a need for code duplication. Literally you can emit a jump when you see a Tilt (that didn't cause a syntax error), no heuristics, no code blowup, no DoS.

pizlonator commented 9 years ago

Your comments convince me even more that we don’t need Tilt syntax and that we don’t need normative text describing any of this.

Your proposal is a jump threading algorithm just like mine. It’s structured differently but it will catch the same cases - the only difference is that yours only looks for “tilt -> X” while mine will apply to any assignment of a constant value to a variable. Your proposal needs to be in the spec because it yields syntax errors. My proposal doesn’t need to be in the spec because it’s an optimization.

I don’t see the value of having a mandatory optimization with new syntax and a new error mode, if the same can be achieved by taking that same optimization and running it as part of the optimization pipeline in the webasm backend. Doing so makes this an implementation issue and we don’t need to say anything about it other than maybe non-normative discussion.

Maybe the spec could be made better by having non-normative text about optimizations. We could say, for example, what a recommended optimization pipeline could look like. It might be worth saying that a webasm backend ought to do register allocation and instruction selection, and maybe some text calling out that we anticipate the need to run constant folding, CSE and basic CFG simplifications. Jump threading is a classic CFG simplification, and we could point out that it is beneficial because of the lack of general control flow constructs. But I’d be fine with excluding any such text - it is the implementor’s job to figure this stuff out.

More comments inline:

On May 5, 2015, at 8:50 PM, Alon Zakai notifications@github.com wrote:

@lukewagner https://github.com/lukewagner: Yeah, that's the point at the end of the proposal: to avoid yucky stuff like that, we'd need to define the semantics precisely, and make the yucky stuff a syntax error. Possible, and not that hard, especially if we've agreed on the 2 basic loop types already. In Tilt done that way, tilt -> X becomes a direct jump to X. Here in fact is a sketch of the full semantics:

When you see tilt -> X, start following "natural control flow", which is defined as follows: If you reach a break/continue, follow it (i.e. do the natural thing). If you reach the end of a forever loop, go to the start (i.e. the natural thing). If you reach a forever loop, enter it (ditto). If you reach the end of a block (e.g. in an if body) then go to the statement after the parent (ditto). If you reach a multiple: If it has X, then it worked! Emit a jump to X and you're done. If not, skip the multiple (i.e. the natural thing) Continue to follow natural control flow, applying the above rules iteratively, until it completes successfully ("it worked!"), or until you hit anything else (a statement, the end of a do-while, an if, etc.); if that happens, emit a syntax error, this was yucky. Anything not causing a syntax error is jump-able; anything giving a syntax error should not have worked. (If I didn't miss anything... ;)

You’re describing a jump threading algorithm that could be made to work for any assignment to of a constant to a variable. Just replace “tilt -> X” with “variable = X where X is constant”, and remove the part about the syntax error.

The only difference to my jump threading rule is that mine speaks of blocks and Phis rather than assignments to variables (I say that “tilt -> X” is just an assignment to a variable). But I believe they will catch the same cases.

@pizlonator https://github.com/pizlonator: Nice! Cool that LLVM can do it.

I actually was not claiming that LLVM can do it; I hadn’t looked at LLVM at the time that I wrote that. I’m claiming that it can be done using the jump threading rule I offered, plus the Phi simplification rule from Aycock & Horspool. I must emphasize that these are super simple rules, no more complicated that what you’re proposing. They are no more complicated to implement and they are no more computationally intensive. I suspect that any decent compiler will have jump threading. Many compilers use the Aycock & Horspool Phi simplification algorithm either as their primary way of converting to SSA or as a clean-up phase to remove unnecessary Phis after other optimizations.

(As an aside, I was just reading the LLVM JumpThreading pass, and it appears that it can do all of this, and probably a fair bit more. It appears to be OK with sometimes cloning code, it does some opportunistic PRE, and it subsumes my jump-threading rule by a two-step process: (1) if A jumps to B, B has a Phi and switches/branches on the value of the Phi, then the jump in A is replaced by the contents of B; and (2) attempt constant-folding on each branch/switch using something like SCCP. My rule happens because if you have A jumps to B and then B switches on a value known to be constant in A, then B will be switching in a Phi so rule (1) holds and then rule (2) constant-folds the switch. As with everything in LLVM, it’s kind of nuts - running blame on it shows that it’s the subject of never-ending carnage.)

It does hit the issues @lukewagner https://github.com/lukewagner was raising though, of how much smarts we want to assume on the client VM. We'd lose predictability by relying on that specific chain of optimizations, which worries me.

Not really. No implementation would be required to implement this. Instead, it would be one of the obvious things that you’d want to implement to get performance, like register allocation and instruction selection.

Surely we aren’t going to require implementations to do those other optimizations or any optimization for that matter. Jump threading is just another optimization. We don’t have to require it, we don’t need syntax to aid it, and the most we have to do is have non-normative text suggesting it on the grounds that the webasm generator might create code patterns that benefit from it.

Regarding whether a Tilt implementation could do the same, then I believe yes. Look at the more detailed semantics I wrote above for @lukewagner https://github.com/lukewagner in this comment. That should work (I hope). Also, note how there is never a need for code duplication. Literally you can emit a jump when you see a Tilt (that didn't cause a syntax error), no heuristics, no code blowup, no DoS.

Right. Neither of us is proposing a rule that mandates code duplication. Both of our jump threading rules will catch the same patterns, if you accept that “tilt -> X” is just an assignment of a variable and a constant.

-Filip

— Reply to this email directly or view it on GitHub https://github.com/WebAssembly/spec/issues/33#issuecomment-99312986.

kripken commented 9 years ago

(Oh, sorry, I thought that was LLVM you were mentioning there.)

Fair points. Overall I think this all comes down to different opinions on how predictable perf should be on the client. If we want all control flow to be predictably fast, I think we need this in the spec; if we prefer to leave optimizations to VMs, then we don't.

Regarding that both options (Tilt and a good optimizing VM) do jump threading, the difference as I see it is that jump threading in the semantics I wrote would be on the AST - something the user emits and controls - while the optimization passes you mention would run on a compiler-internal data structure - something the user has far less control over (and perhaps no understanding of). That ties in again to the issue of predictability being possible with Tilt, but uncertain if we leave it to VMs to optimize on their own.

But also, on the AST we can easily see what can be turned into jumps; it is less clear that that would be the case after it is transformed to the internal compiler IR, which is "messier", lower-level, and might reorder things a little, e.g., some code might show up between (what used to be) the Tilt and the multiple. So I am not sure that the VM's jump threading would always work where Tilt would, but I do see your point that it is likely to succeed.

lukewagner commented 9 years ago

This discussion makes me question again what it is that we're trying to optimize here with our design choice of structured control flow primitives over 'goto'. My current answers are:

Structured control flow admits a compact yet simple binary encoding, presumably smaller than simple basic-block + goto control flow.
Structured control flow provides efficient (translation and generated code) asm.js polyfill.
~~Structured control flow is simpler to compile.~~

1 and 2 could be preserved by keeping the structured control flow primitives we have now and relooping as a standard compiler pass. As for 3, if we are generating irreducible control flow in the backend anyway (via jump-threading/tilt optimizations), I'm not sure what's really simpler here.

OTOH

I think goto wins out slightly in terms of ergonomics and developer expectations over while+switch.
Both tilt and jump-threading require some extra work to optimize properly which goto doesn't. Yes, the JS engines in all browsers all have beefy backends that can do this, but I like the general design goal that wasm can be compiled to pretty-efficient code by a really simple single-pass compiler (not just for browser baseline, but for new non-browser uses). True, most control flow can be relooped, so goto vs. while+switch matters less, but sometimes unstructured control flow does find its way onto the hot path (e.g., the zany interpreter loops we've come across).

pizlonator commented 9 years ago

On May 5, 2015, at 10:14 PM, Alon Zakai notifications@github.com wrote:

(Oh, sorry, I thought that was LLVM you were mentioning there.)

Fair points. Overall I think this all comes down to different opinions on how predictable perf should be on the client. If we want all control flow to be predictably fast, I think we need this in the spec; if we prefer to leave optimizations to VMs, then we don't.

I think that having an occasional branch on a value available in a local variable is sufficiently low cost that we don’t need it in the spec, especially considering that the vast majority of these cases - likely all of them - will be caught by jump threading.

Regarding that both options (Tilt and a good optimizing VM) do jump threading, the difference as I see it is that jump threading in the semantics I wrote would be on the AST - something the user emits and controls - while the optimization passes you mention would run on a compiler-internal data structure - something the user has far less control over (and perhaps no understanding of). That ties in again to the issue of predictability being possible with Tilt, but uncertain if we leave it to VMs to optimize on their own.

But also, on the AST we can easily see what can be turned into jumps; it is less clear that that would be the case after it is transformed to the internal compiler IR, which is "messier", lower-level, and might reorder things a little, e.g., some code might show up between (what used to be) the Tilt and the multiple. So I am not sure that the VM's jump threading would always work where Tilt would, but I do see your point that it is likely to succeed.

I just don’t buy this. When it comes to constants in local variables and branches, you see strictly more information when you go to a lower-level IR.

-Filip

— Reply to this email directly or view it on GitHub https://github.com/WebAssembly/spec/issues/33#issuecomment-99324306.

pizlonator commented 9 years ago

On May 5, 2015, at 10:20 PM, Luke Wagner notifications@github.com wrote:

This discussion makes me question again what it is that we're trying to optimize here with our design choice of structured control flow primitives over 'goto’.

This is a good discussion to have. :-) My current answers are:

Structured control flow admits a compact yet simple binary encoding, presumably smaller than simple basic-block + goto control flow. Do we know how big this is? Structured control flow provides efficient (translation and generated code) asm.js polyfill. This is a non-trivial benefit in the short term. Structured control flow is simpler to compile. I agree that this benefit is not real.

Possible additional benefit:

Structured control flow has smaller verification burden.

You don’t have to check that your goto branch offsets are within bounds. Such “security” benefits are so subjective. What do y’all think?

1 and 2 could be preserved by keeping the structured control flow primitives we have now and relooping as a standard compiler pass. As for 3, if we are generating irreducible control flow in the backend anyway (via jump-threading/tilt optimizations), I'm not sure what's really simpler here.

OTOH

I think goto wins out slightly in terms of ergonomics and developer expectations over while+switch. That’s true. Both tilt and jump-threading require some extra work to optimize properly which goto doesn't. Yes, the JS engines in all browsers all have beefy backends that can do this, but I like the general design goal that wasm can be compiled to pretty-efficient code by a really simple single-pass compiler (not just for browser baseline, but for new non-browser uses). True, most control flow can be relooped, so goto vs. while+switch matters less, but sometimes unstructured control flow does find its way onto the hot path (e.g., the zany interpreter loops we've come across). I don’t buy the single-pass thing. I thought we’ve already established that the relooper requires the “label” variable infrequently enough that it’s not a showstopper. Any backend that does good regalloc/isel will have multiple passes, and will have something like CFG simplifications at some point. Often more than once.

I think that wasm with structured control flow and an occasional “label” variable due to weird CFGs will still qualify as something that you could compile somewhat efficiently with a single-pass compiler. The output of this compiler won’t be as good as the output of a beefy backend, but that’s not because of the lack of goto or jump threading - it’ll be because of all of the other things you lose when your backend is restricted to a single pass.

-Filip

— Reply to this email directly or view it on GitHub https://github.com/WebAssembly/spec/issues/33#issuecomment-99325964.

lukewagner commented 9 years ago

Do we know how big this is?

In the polyfill prototype, most statement nodes are encoded as a single byte (if, loops, break, continue, ret); only the labeled breaks/continues/switch/block have a variable-length int immediate (# enclosing blocks to break out of / # cases / # stmts). When I imagine a compact encoding of basic/blocks goto, all of these single-byte statement nodes would require additional var-int immediates (index of the target block) for what are currently child statement nodes. Also, for the labeled-break/continue ops, the immediates would likely be on average larger (many basic blocks vs usually-small block nesting depth) and thus take more bytes for the variable-length int. In the prototype polyfill, statement nodes constitute 12% of the bytes. I would estimate the above goto scheme (in the absence of structured control flow nodes) would require at least twice the bytes (though I can take the time to do a real experiment later). Even with compression, there should still be a significant difference, though not enormous. When you combine this long-term win with the short-term big win for the polyfill, I think the argument is pretty compelling, though, for including these structured primitives.

Structured control flow has smaller verification burden.

You don’t have to check that your goto branch offsets are within bounds. Such “security” benefits are so subjective. What do y’all think?

My first inclination is that a function body would be described as an array of basic blocks (much like the code section of a module is an array of functions) and thus a goto would use the index of the block, not a raw offset into the byte stream. Thus, the case of bounds checking goto targets would be analogous to all the other cases where indices are used (function, local var, global).

I thought we’ve already established that the relooper requires the “label” variable infrequently enough that it’s not a showstopper.

Yes, we have established that it does well enough on average, taking "average" to mean that on an average program, performance will not be affected by these bad cases (w/o any jump-threading opts). However, the occasional program will be tremendously affected. The first example that comes to mind is the Guile Scheme interpreter which Marc Feeley tries to Emscripten-compile from time to time and does much worse than native. A second bad example is sqlite which ends up with a lot of label usage which hurts general codegen. In general, though, there is a class of programs that spend the majority of their time in one big spaghetti function. This doesn't mean we have to care about them for v.1, but if we're considering longer-term ideals, I think we should. I also don't think we should commit any warts to wasm solely for the "simple compilers on spaghetti codes" use case; but I do think this is a mild use case in the "pro" column of goto.

titzer commented 9 years ago

I don't see how Tilt is composable, since it relies on a single hidden local variable.

E.g.

L1: while (1) { if (checkA()) { tilt -> L2; } else { if (checkB()) break; while (1) { work(); if (checkC()) { tilt -> L3; break L1; } } never(); } } QuestionMark multiple { L2: { print('checkA passed!') } L3: { print('checkC passed!') } }

In this case, if QuestionMark contains a tilt (even an internal one), then it destroys the state variable. Or are you saying that QuestionMark must be empty or cannot contain a tilt? I think the the only way to make this make sense it to make the hidden state variable well-scoped; i.e. it must have a scope which introduces it.

On Wed, May 6, 2015 at 12:30 AM, Alon Zakai notifications@github.com wrote:

The goals of this proposal are

Keep representing control flow in a structured manner, as we already agree; the structure makes it compressible, easy to optimize on the client, and is nice for polyfilling too.

Reduce currently-existing control flow overhead, which we know exists in practice in solutions like Emscripten, and may be impossible to avoid (even if it is typically small).

Do so in a way that allows both (1) a super-simple VM implementation, which might not reduce that overhead, but shows this does not force undue complication on browsers, and (2) a clear optimization path to eliminating that overhead entirely.

This proposal introduces two new AST constructs, best explained with an example:

while (1) { if (check()) { tilt -> L1; } else if (other()) { tilt -> L2; } else { work(); } } multiple { L1: { l_1(); } L2: { l_2(); } } next();

What happens here is if check(), then we execute l_1(), and if other() then we execute l_2(); otherwise we do some work() and go to the top of the loop. Conceptually, Tilt doesn't actually affect control flow - it just "goes with the flow" of where things are moving anyhow, but it can "tilt" things a little one way or another (like tilting a pinball machine) at specific locations. Specifically, if control flow reaches a multiple, then the tilt has an effect in picking out where we end up inside the multiple. But otherwise, Tilt doesn't cause control flow by itself.

In more detail, the semantics are fully defined as follows:

Define label as a "hidden" local variable on the stack, only visible to the next 2 bullet points.

tilt -> X sets label to a numerical id that represents the label X.

multiple { X1: { .. } X2: { .. } .. } checks the value of label. If it is equal to the numerical id of one of the labels X1, X2, ... then we execute the code in that label's block, set label to a null value (that is not the id of any label), and exit the multiple. Otherwise (if it is not equal to any of the ids), we just skip the multiple.

Note that Tilt does not move control flow to anywhere it couldn't actually get anyhow. Control flow still moves around using the same structured loops and ifs and breaks that we already have. Tilt only does a minor adjustment to control flow when we reach a multiple. The multiple can be seen as a "switch on labels", and the tilt picks which one we enter. But again, we could have reached any of those labels in the multiple anyhow through structured control flow (and picked which label in the multiple using a switch on the numerical id of the label).

The semantics described above also provide the "super-simple" implementation mentioned in the goals. It is trivial for a VM to implement that - just define the helper variable, set it and test it - and it would be correct; and control flow is still structured. But, it is also possible in a straightforward way to just emit a branch from the tilt to the right place. In the example above, that is perhaps too easy, so consider this more interesting example:

L1: while (1) { if (checkA()) { tilt -> L2; } else { if (checkB()) break; while (1) { work(); if (checkC()) { tilt -> L3; break L1; } } never(); } } multiple { L2: { print('checkA passed!') } L3: { print('checkC passed!') } }

It is straightforward for the VM to see that tilt -> L2 will reach L2, and tilt -> L3 will reach L3 - note how we need a break after the tilt to achieve that - so it can just emit branches directly. The helper variable overhead can be eliminated entirely.

This idea is modeled on the Relooper https://github.com/kripken/emscripten/raw/master/docs/paper.pdf algorithm from 2011 http://dl.acm.org/citation.cfm?id=2048224. There is a proof there that any control flow can be represented in a structured way, using just the available control flow constructs in JavaScript, and using a helper variable like label mentioned in the Tilt semantics, without any code duplication (other approaches split nodes, and have bad worst-case code size situations). The relooper has also been implemented in Emscripten, and over the last 4 years we have gotten a lot of practical experience with it, showing that it gives good results in practice, typically with little usage of the helper variable.

Thus, we have both a solid theoretical result that shows we can represent any control flow - even irreducible - in this way, and experience showing that that helper variable overhead tends to be minimal. In the proposal here, that helper variable is, in essence, written in an explicit manner, which allows even that overhead to be optimized out nicely by the VM.

A note on irreducible control flow: As mentioned, Tilt can represent it, e.g.,

tilt -> L2; while (1) { multiple { L1: { duffs(); if (done()) break; } L2: { device(); } } }

This is not surprising as the relooper proof guarantees that it can. And we can also make that irredicuble control flow run at full speed (if the VM optimizes Tilt, instead of doing just the super-simple semantics as its implementation). Still, this is not quite as good as if we directly represented that irreducible control flow as basic blocks plus branches directly, since with Tilt we do need to analyze a little to find that underlying control flow graph. So something like proper tail calls, which can directly represent that graph, may still be useful, but it is debatable

at least a large set of cases of proper tail calls seem to be handled by Tilt (the cases of having heavily irreducible control flow), and without the limitations of proper tail calls like number of parameters. However, there is obviously a lot to debate on that.

In any case, putting aside irreducible control flow and proper tail calls, what Tilt is clearly great for is to eliminate any small amounts of helper variable usage, that occur often in practice, stemming from either small amounts of irreducible control flow (either a goto in the source, or a compiler optimization pass that complexified control flow for some reason), or reducible but complex enough control flow that the compiler didn't manage to remove all helper variable usages (it is an open question whether 100% can be removed in tractable time). Having Tilt in wasm would open the possibility for straightforward and predictable removal of that overhead.

To summarize this proposal, it can be seen as adding an "escape valve" to structured control flow, where code remains still structured, but we can express more complex patterns in a way that is at least easily optimizable without magical VM tricks.

That concludes the core of this proposal. I see two possible large open questions here:

The above cannot optimize indirect branches. Some interpreter loops really want that. I'm not sure if we want to get into that now. But it seems that a clear path is available - an "indirect Tilt" would use a variable instead of a label, together with a way to get the numeric id of a label.

In the proposal above, we define the semantics in a super-simple way. I like the simplicity and how it also shows how to implement the feature in a trivial way, so this should not slow down development of wasm VMs. But a downside to that is that it allows one to write nonsense like

tilt -> L1; work(); L2: while (1) { more(); tilt -> L2; if (check()) break; } multiple { L3: { never(); } }

None of the tilts do anything; the multiple is ignored; just reading that makes me nauseous. But it is valid code to write, even if silly. It might be nice to get a parse-time error on such things, and it isn't hard. But to do so requires that we define what is errored on very precisely, which is a lot more detailed than the very simple (but fully defined!) semantics described above.

And that brings me to a final point here. We could in principle delay discussion of this to a later date. However, if we do want to add this later, and do want to error on such nonsense as the last example, then how easy and clean it is to define the semantics will depend on decisions that we do make now. In particular, the fewer control flow constructs we have, the easier and cleaner it will be; every construct may need to be involved in the description. Also, we might want to start out with simple constructs that make that easier (I have some ideas for those, but nothing too concrete yet).

In other words, what I am getting at is that if we design our control flow constructs as a whole, together with Tilt, then things will be much nicer than if we try to tack Tilt on later to something designed before it. For that reason, we might want to discuss it now.

Bikeshedding: Suggestions for better names are welcome. Another name I considered was slide -> X, as in "I totally meant to slide there" https://i.chzbgr.com/maxW500/8165877248/h5E9BC18C/.

— Reply to this email directly or view it on GitHub https://github.com/WebAssembly/spec/issues/33.

kripken commented 9 years ago

Yes, tilt isn't meant to be composable in that sense - if QuestionMark is basically anything, that wouldn't be a valid tilt (except for the small list of necessary exceptions in the list given above; stuff like entering a forever loop is ok). Tilts need to be super-simple to figure out, almost as simple as gotos, basically. I would say that composability is not important because tilts are just a way to represent control flow in a structured way: they would be generated as the final step of converting a program to wasm, and removed among the first steps of parsing wasm on the client, so they aren't meant to be part of an IR that you do things like inlining and other compositions.

titzer commented 9 years ago

On Wed, May 6, 2015 at 8:38 PM, Alon Zakai notifications@github.com wrote:

Yes, tilt isn't meant to be composable in that sense - if QuestionMark is basically anything, that wouldn't be a valid tilt (except for the small list of necessary exceptions in the list given above; stuff like entering a forever loop is ok). Tilts need to be super-simple to figure out, almost as simple as gotos, basically. I would say that composability is not important because tilts are just a way to represent control flow in a structured way: they would be generated as the final step of converting a program to wasm, and removed among the first steps of parsing wasm on the client, so they aren't meant to be part of an IR that you do things like inlining and other compositions.

That's somewhat nonsensical. I think we should prefer composable AST structures, and it seems like scoped state variables would essentially subsume tilt and offer a local rewriting with obvious semantics and no case-by-case analysis.

All other control structures so far have the single-entry single-exit property. That means that all internal control flow is always nested. I don't propose to do heavy lifting on the IR as an AST but it's a nice property to have; e.g. inlining a method is literally substituting in a call with its body.

-B

— Reply to this email directly or view it on GitHub https://github.com/WebAssembly/spec/issues/33#issuecomment-99567386.

titzer commented 9 years ago

On Wed, May 6, 2015 at 9:26 PM, Ben L. Titzer titzer@google.com wrote:

On Wed, May 6, 2015 at 8:38 PM, Alon Zakai notifications@github.com wrote:

Yes, tilt isn't meant to be composable in that sense - if QuestionMark is basically anything, that wouldn't be a valid tilt (except for the small list of necessary exceptions in the list given above; stuff like entering a forever loop is ok). Tilts need to be super-simple to figure out, almost as simple as gotos, basically. I would say that composability is not important because tilts are just a way to represent control flow in a structured way: they would be generated as the final step of converting a program to wasm, and removed among the first steps of parsing wasm on the client, so they aren't meant to be part of an IR that you do things like inlining and other compositions.

That's somewhat nonsensical. I think we should prefer composable AST structures, and it seems like scoped state variables would essentially subsume tilt and offer a local rewriting with obvious semantics and no case-by-case analysis.

All other control structures so far have the single-entry single-exit property. That means that all internal control flow is always nested. I don't propose to do heavy lifting on the IR as an AST but it's a nice property to have; e.g. inlining a method is literally substituting in a call with its body.

I should note that above the single-entry single-exit sentence above is true for all AST trees that don't have free break/continue labels.

-B

— Reply to this email directly or view it on GitHub https://github.com/WebAssembly/spec/issues/33#issuecomment-99567386.

kripken commented 9 years ago

I do agree that composability is a nice property to have, and "proper" tilt - tilt with errors on yucky syntax - does harm that. You can maintain full composability, however, if you are ok with "loose" tilt - tilt where yucky stuff does nothing.

In other words, for proper tilt to literally be guaranteed to optimize into a goto (if it didn't trigger a syntax error), you have to lose composability, because proper tilt in fact is like a proof that it can turn into a goto, and the proof doesn't compose with other stuff, it has to remain in a certain very inflexible form.

MikeHolman commented 9 years ago

A bit late to the party here, but here's what I'm thinking.

A "loose" tilt is no better than having a local variable and running a jump threading algorithm. If we were to have a loose tilt, then I don't think I would bother with including it in our internal IR, since we would still need to perform all the same validation as it would for a more generic jump threading. For this reason, I would only really consider the "proper" tilt.

A "proper" tilt would be stronger in that a baseline JIT could safely do jump threading without any analysis needed. It also might save some work in your full JIT if your current backend wouldn't already catch this jump threading opportunity.

This advantage makes it slightly appealing, but the reduction to composability/increase to semantic complexity are serious drawbacks. Future AST constructs will necessarily have to consider how they impact tilt/multiple, and while I don't really foresee any terrible problems, I can imagine that this feature continues to accrue complexity. For example, maybe we will need to add semantics for how this will interact with mutexes. And maybe we won't, but I'm afraid this will end up being a feature which we constantly need to consider when adding new constructs. It also adds some extra complexity while validating, e.g. it will require loop detection to cover cases like tilt->X; forever {continue;}.

For me, the drawback of increasing complexity outweighs the advantage of potentially faster baseline JIT, and I'd prefer to stick with what we have (though I think gotos deserve their own discussion and serious consideration).

kripken commented 9 years ago

Well put, and I've come to think that way as well. Proper tilt if anything, but even that is not worth it due to the downsides. So we will not have a clear guarantee of full speed on all control flow (including irredicuble), which is a little sad, but still the best option we have.

sunfishcode commented 9 years ago

We presently have what seems to be a reasonable consensus around the current approach, so I'd like to close this issue. Of course, if people wish, this issue can always be reopened or the ideas in it reraised.