WebAssembly / design

WebAssembly Design Documents
http://webassembly.org
Apache License 2.0
11.4k stars 694 forks source link

Eliminate Expression/Statement distinction in AST? #223

Closed titzer closed 8 years ago

qwertie commented 9 years ago

While I haven't worked this out in detail, I know from designing Enhanced C# (unfinished) and LES that languages in the Algol tradition (C/C++/Java etc) can be viewed as purely expression-based languages like LISP or Haskell.

As I was saying elsewhere, I think all executable code (if not more of the system) should be expression-based.

Merging statements and expressions into a single concept called 'expressions' seems to produce a strict superset of the current plan. I think it would be a simpler and more capable model than what is currently proposed, and it won't substantially alter code generation for the Algol branch of languages (or, if it turns out that it does, perhaps the language could be tweaked until it doesn't) The translation to asm.js could be more complicated, admittedly, but remember that this is a short-term problem--we won't use a polyfill forever.

(goto can be shoehorned into the "everything is an expression" model too, albeit clumsily.)

jfbastien commented 9 years ago

Yes. It doesn't seem to provide much at the moment.

qwertie commented 9 years ago

I'd like to be able to write code like this:

x += switch(...) {
    case 1: break foo();
    case 2: break bar();
    case 3: break baz();
}

or even:

doStuff(12345, if (...) {
        foo()
    } else {
        y = 0;
        forever {
            y += ...
            if (...) break(y);
        }
    });

(The latter, while a bit ugly, is shorter than the typical way you'd write that code, so it agrees with the design goal of "size efficiency")

Things like this are possible in LISP and Haskell (except that Haskell doesn't allow mutable variables) and I'm sure I've seen it in at least one other (otherwise traditional) imperative language.

MikeHolman commented 9 years ago

@qwertie I think this is a pretty cool idea, but I'm concerned about polyfill to js and added complexity to VMs (it seems like you would have to build up a flowgraph for type checking).

jfbastien commented 9 years ago

I agree that such syntax is nice if someone is writing wasm files manually. I don't think that's a goal of WebAssembly, though.

I nonetheless think that there could be advantages to what you suggest:

Before we discuss this further, can you please confirm that you're a member of the W3C community group? "@qwertie" isn't quite a name I can find there! The WebAssembly effort really needs to operate under the W3C's umbrella, and significant contributions require membership.

lukewagner commented 9 years ago

FWIW, @sunfishcode have also been talking about having all statements be expressions for the purpose of avoiding get_local/set_local ops (which take 30-40% of the bytes, at least in the prototype) by leveraging AST's concise representation of single-use expressions. It'd be good to motivate this by showing size savings (even in the presence of the macro/generic compression layers), but I'm in favor.

qwertie commented 9 years ago

@jfbastien Yes, I'm a member of the W3C group (real name: David Piepgrass)

IMO expression-based languages are 'the future', because even if one doesn't like them stylistically, they have other advantages. For example, languages may have higher-level constructs that are naturally lowered into "an expression-with-a-loop-inside": e.g. list comprehensions, function inlining.

kripken commented 9 years ago

@lukewagner that sounds interesting, but I don't quite see it, can you please elaborate?

jfbastien commented 9 years ago

It sounds like we agree it would be nice to have, but only if data shows that it'll be useful. Style is great, but we're not trying to optimize for that.

kg commented 9 years ago

The only reason I suggested it is that it reduces the tree depth of a typical AST somewhat, which could improve compression. I've yet to test this theory out in depth other than doing the change in one specific case and seeing my test files get considerably smaller. (from Block -> [ExpressionStatement, ...] to Block -> [Expression, ...])

MikeHolman commented 9 years ago

I have given this some more thought and I realize that I way overestimated the complexity for WebAssembly VMs, and from that perspective I'm fine with this. And I can believe that there are some modest wins in AST size (even considering macro layer), so that does give some motivation.

However, I still can't imagine a good way to translate this to js. The only thing I can think of which works in the general case is outlining the flow expressions into function expressions/calls, but that is obviously too costly. I'm sure we could hoist them into separate statements in many cases (and assign the result to a new local for value producing flow expressions), but that would require the polyfill to do dependence analysis...

Unless someone has a better idea for polyfill, I think we should leave this as a possible future feature for a time when the polyfill isn't as important.

kg commented 9 years ago

My implication here is that the distinction doesn't entirely go away, but it isn't encoded into the AST. For example, if you're in a block and find an expression, it's obviously a free-standing statement. Much like how we've been leaning towards using context to infer things about operations (we know this expression is an integral one because it's enclosed inside an int setlocal operation), we can often make robust inferences about whether something is an expression or a statement in a given context.

There are only a few cases where it really matters, though. BlockStatement -> Block and ExpressionStatement -> Expression can both be pretty profound in terms of making the AST shallower and slightly smaller to encode. Best-case, each one has a type indicator (of some kind) for the 'statement' (block, expression) and then a type indicator for the inner expression in the case of the ExpressionStatement. It's possible there are other ways to optimize out these costs, of course. This might be something that's best tackled at the encoding layer much like macros.

creationix commented 9 years ago

I'm all for making the AST simpler and smaller, this had advantages in several areas as noted.

Is the only real blocker finding an efficient transpile target for the polyfill?

I think with @kg's idea most expressions used as statements will be obvious and won't need wrapping in functions. I would love to see some concrete examples of the AST in question so I could mull it over.

MikeHolman commented 9 years ago

@kg I definitely think for non-value producing expressions (e.g. set_local, or void function calls) we can remove the need for ExpressionStatement->Expression. For example, we can let void type expressions share an opcode space with statements (and still keep the semantic differences).

My concern is with what @qwertie is talking about, where we do completely remove the distinction between them.

sunfishcode commented 9 years ago

Right now, the only way a local value can be live across a basic block boundary (ignoring control flow embedded in expressions) is to live in a local variable. Making statements be expressions would mean that some values could flow out of blocks without having to go through local variables, which would theoretically reduce the size due to get_local/set_local that @lukewagner mentioned.

An interesting twist is that we could extend this from allowing a single value to flow out of a block to allowing multiple values to flow out of a block, by allowing expressions to return multiple values (something which would have other uses besides). For sanity, we don't want to deal with tuples floating through the language in general, so we'd probably require multiple return values to be immediately unpacked and named, so there'd still be some set_local/get_local overhead, but it'd be reduced.

Either way, it would be interesting to see how much this would help compression.

One question I have though is: do we really want expressions with embedded control flow at all? The current design mentions comma and short-circuit ?: but do we even want those? Embeddable control flow makes it somewhat more complex to construct a CFG from the AST, which could complicate some baseline JIT strategies. If the compression wins aren't great, this may be worth considering.

kg commented 9 years ago

The question of whether we want embedded control flow at all is a very important one. I'll have to think about it in depth, but I personally don't like the comma operator or short-circuit ?: (mostly on instinct). I've just been assuming that we want to be able to express them in our AST. IIRC in past discussions we have used the existence of the short-circuiting ternary operator to justify omitting some other things, since asm.js leans on it heavily.

titzer commented 9 years ago

On Thu, Jun 25, 2015 at 2:45 AM, Dan Gohman notifications@github.com wrote:

Right now, the only way a local value can be live across a basic block boundary (ignoring control flow embedded in expressions) is to live in a local variable. Making statements be expressions would mean that some values could flow out of blocks without having to go through local variables, which would theoretically reduce the size due to get_local/set_local that @lukewagner https://github.com/lukewagner mentioned.

An interesting twist is that we could extend this from allowing a single value to flow out of a block to allowing multiple values to flow out of a block, by allowing expressions to return multiple values (something which would have other uses besides). For sanity, we don't want to deal with tuples floating through the language in general, so we'd probably require multiple return values to be immediately unpacked and named, so there'd still be some set_local/get_local overhead, but it'd be reduced.

Either way, it would be interesting to see how much this would help compression.

One question I have though is: do we really want expressions with embedded control flow at all? The current design mentions comma and short-circuit ?: but do we even want those? Embeddable control flow makes it somewhat more complex to construct a CFG from the AST, which could complicate some baseline JIT strategies. If the compression wins aren't great, this may be worth considering.

For reference, both ?: and comma in the v8-native-prototype decoder were an additional 10 lines of code each, approximately.

— Reply to this email directly or view it on GitHub https://github.com/WebAssembly/design/issues/223#issuecomment-115056256.

sunfishcode commented 9 years ago

I guess that's true, since you form the CFG while decoding. I was thinking about forming the CFG based on an in-memory AST, but that's not the common case here for baseline JITs.

rossberg commented 9 years ago

I strongly second eliminating statements. This would be particularly useful for translating higher-level languages into wasm in the future, which will often need to create control structure inside expression contexts.

Even though supporting HL languages is not an initial goal, the statement distinction is a fundamental design decision that cannot easily be reverted later.

As for the JS polyfill, there is a proposal for adding "do-expressions" to the next version of JavaScript (which I am a co-champion of). The idea is to allow "do { <statement-list> }" as an expression. Once that arrives (hopefully next year), translation to JS would become straightforward.

rossberg commented 9 years ago

For the record, Ben asked me whether eliminating statements would necessitate an explicit operator for discarding values. Replicating my reply here:

I think that's not needed. Basically, you can always tell from the syntactic context where a value is discarded (off-hand: left of a comma/semicolon, at the end of a loop body, and recursively in the arm of any branch construct whose value is discarded).

Another way to spec this easily is via the type system. Essentially, some expressions produce type void (e.g. loops), some expressions require their subexpressions to have type void (e.g. the left operand of a semicolon), and there is an implicit conversion from any type to void, whose operational interpretation is discarding the value.

This nicely generalises the situation of function calls that return void being expressions already.

wora commented 9 years ago

Eventually people will compile different languages to wasm. There is hardly much benefit to keep distinction between statements and expressions even for C/C++. Such distinction will very likely become a real disadvantage for handling other languages. Now would be the best time to eliminate the distinction entirely.

MikeHolman commented 9 years ago

I don't think the polyfill can use any new, not implemented, js features (including modules for that matter). That would remove all utility from it.

kripken commented 9 years ago

It sounds like there are a few good reasons to want to eliminate the expression/statement distinction. However, we do need a good solution for polyfilling to current JS, as @MikeHolman said, so I think it would be useful to see an experiment showing that this change would not prevent that. Modifying @lukewagner's prototype might be one way.

MI3Guy commented 9 years ago

I'm not exactly sure how relevant this is given that WebAssembly doesn't really have subtyping. One useful representation for the type of jump expressions such as break and return is a "bottom" type. (A type that is a subtype of all other types and has no instances.) This would also be useful when exceptions are implemented.

lukewagner commented 9 years ago

@kripken If we assume that the polyfill can inject temporary locals (something it already does to handle one unfortunate corner case with mixed f32/f64 typed array stores), I don't see a fundamental problem with this. The one challenge is keeping everything single-pass: if you've started emitting an expression tree and encounter a loop, it's too late. For this, the separate issue I just mentioned, and a few other twiddly details that I current cheat on in the polyfill prototype, I'm thinking we'll really end up wanting a polyfill-optimized specific layer (non-standard, so we can do whatever for short-term polyfill considerations, and then change before/if standardizing). Also, while temporary variables could in theory result in much worse code, the specific layer could be designed (in cooperation with the wasm-compiler) to effectively never need them.

kripken commented 9 years ago

Sounds plausible, but I think the real question is how much overhead that would add, compared to the much more straightforward situation we have investigated so far. That's why I'd be curious to see experiments here.

titzer commented 8 years ago

I was initially against statements == expressions due to implementation concerns, but in the intervening time have implemented blocks, ifs, loops, break, and continue in terms of expressions, finding the implementation burden to be NBD. There are nice advantages to having statements == expressions, like getting rid of return and comma and conditional, as well as allowing function bodies to be a single expression. So I support statements == expressions.

qwertie commented 8 years ago

A lot of work has been done since this issue was opened. The current spec doesn't have a statement-expr distinction, does it? Has a plan for polyfilling been established? And is it fair to say there is consensus on the issue?

lukewagner commented 8 years ago

I also support stmts=exprs. There are a couple of viable polyfill strategies. If we integrate the polyfill with the specific layer as suggested above, then the specific layer binary stream could contain a marker in front of expression trees that had problematic stmt nesting (from an asm.js pov) so that the polyfill could conservatively emit that expression tree in three address code.

rossberg commented 8 years ago

I'm also in favour (surprise!). I can see several ways of polyfilling it, starting from dumb solutions like creating auxiliary functions to ones introducing minimal numbers of auxiliary variables at the statement/expression boundary.

kripken commented 8 years ago

There are nice advantages to having statements == expressions, like getting rid of return

This would remove the need for a final return at the end of a function, but I don't see how it can avoid needing return in the middle if we want an early exit?

titzer commented 8 years ago

Early returns can be done by making the body of the function a block and using a break-with-value from that block.

On Fri, Oct 23, 2015 at 2:35 PM, Alon Zakai notifications@github.com wrote:

There are nice advantages to having statements == expressions, like getting rid of return

This would remove the need for a final return at the end of a function, but I don't see how it can avoid needing return in the middle if we want an early exit?

— Reply to this email directly or view it on GitHub https://github.com/WebAssembly/design/issues/223#issuecomment-150698919.

rossberg commented 8 years ago

@kripken, and what titzer describes is how the spec desugars return already.

kripken commented 8 years ago

A break could achieve the same result, I agree. But then a concern is that the break would need to specify the nesting level. That adds some entropy that I would guess compresses more poorly. I don't see a way to do an early return that doesn't have downsides compared to return.

Overall I don't think we have clarity on whether this improves or degrades code size? I think we've seen arguments both ways: having statements == expressions lets you return a value in some cases "for free", avoiding a temp var, but for the most part statements like while etc. will not return a value anyhow, so nothing is saved. And there are some common cases like early returns where this seems detrimental. I also worry there might be cases where a value must be manually discarded, e.g.

void foo() {
  bar()
}

where bar returns a value, used in other places calling it, but that we want to ignore here.

Overall, my current understanding is that

  1. This would be nicer. Having everything be an expression is more elegant!
  2. It's not clear yet if this helps or hurts code size.
  3. There were arguments above that this would help port languages in which statements == expressions, but I'm not sure I follow that, since e.g. LLVM IR isn't like that, but pretty much everything can compile to it these days. On the other hand, having statements == expressions might help making compilers for those languages somewhat simpler, but that is a far weaker benefit.
  4. This complicates the JS polyfill, but does not make it impossible.

Seems like a mix, so far?

titzer commented 8 years ago

On Fri, Oct 23, 2015 at 5:20 PM, Alon Zakai notifications@github.com wrote:

A break could achieve the same result, I agree. But then a concern is that the break would need to specify the nesting level. That adds some entropy that I would guess compresses more poorly. I don't see a way to do an early return that doesn't have downsides compared to return.

Breaks already specify their nesting level. It's easy to add a level-0 break as a separate opcode to save a byte.

Overall I don't think we have clarity on whether this improves or degrades code size? I think we've seen arguments both ways: having statements == expressions lets you return a value in some cases "for free", avoiding a temp var, but for the most part statements like while etc. will not return a value anyhow, so nothing is saved. And there are some common cases like early returns where this seems detrimental. I also worry there might be cases where a value must be manually discarded, e.g.

We're not at the point where we're squeezing bytes, but it's easy to imagine some one-byte breaks like break0, break1, break2, when combined with an implicit block for function bodies, is equivalent.

void foo() { bar() }

where bar returns a value, used in other places calling it, but that we want to ignore here.

Void can accept and discard any value, so this is NBD.

Overall, my current understanding is that

  1. This would be nicer. Having everything be an expression is more elegant!

That's a plus.

  1. It's not clear yet if this helps or hurts code size.

That's a neutral.

  1. There were arguments above that this would help port languages in which statements == expressions, but I'm not sure I follow that, since e.g. LLVM IR isn't like that, but pretty much everything can compile to it these days. On the other hand, having statements == expressions might help making compilers for those languages somewhat simpler, but that is a far weaker benefit.

That's a plus for some languages, and a zero for LLVM. So plus.

  1. This complicates the JS polyfill, but does not make it impossible.

That's a minus, but the magnitude is unclear. It could be that other things complicate the polyfill more.

Seems like a mix, so far?

I think we're comfortably on the positive side here, which is where the consensus is coming from.

— Reply to this email directly or view it on GitHub https://github.com/WebAssembly/design/issues/223#issuecomment-150726350.

kripken commented 8 years ago

I'm mostly fine with this, except that "we don't know" != "neutral". Optimally we could take the time to investigate this carefully.

In any case, I do think that this - and possibly other changes, as you said, but this seems the biggest to me - makes the polyfill kind of pointless, in the sense that it's going to be less efficient to (a) make 1 wasm build and convert that to JS on the client when necessary, vs. (b) make 1 wasm build and 1 JS build. I suspect people will end up doing (b). Does that worry anyone?

MikeHolman commented 8 years ago

It does worry me. I don't really see any benefit other than "it feels good", and the potential impact on the polyfill is large. The polyfill is not a hugely important scenario for me, but I think it is useful for the health of the project. And it seems like we would be obliterating it in order to add a superfluous feature.

rossberg commented 8 years ago

I assume we actually do want to keep return as a sugar-level operator (that doesn't carry an explicit depth). So that specific concern should not be an issue.

I don't agree that this makes the polyfill pointless. I also don't think that it makes it less efficient. It does make it slightly more interesting, but not in infeasible ways.

rossberg commented 8 years ago

@kripken:

  1. There were arguments above that this would help port languages in which statements == expressions, but I'm not sure I follow that, since e.g. LLVM IR isn't like that, but pretty much everything can compile to it these days. On the other hand, having statements == expressions might help making compilers for those languages somewhat simpler, but that is a far weaker benefit.

Don't assume that all compilers will want to go through a complex pipeline like LLVM. In fact, I'm pretty sure the vast majority won't. Somewhere else I was pointing to the long list of compilers to JS -- if only a fraction of those ever get ported to Wasm (and I dearly hope so!) then we already have plenty of compilers that will benefit from expressions, some significantly.

titzer commented 8 years ago

We will try to answer the polyfill question definitively in the next two months, as @kg will be working on it.

kripken commented 8 years ago

@rossberg-chromium: yes, I completely agree, making it easier for all those languages is a very good thing. I worry about compromises towards that goal, but probably too much ;)

Regarding polyfill efficiency: If the polyfill needs to generate out of line functions as @rossberg-chromium suggested, or perform a more complex analysis to place auxiliary variables, or use three address code based on markers as @lukewagner suggested, then all of those can work, but they will necessarily reduce efficiency of either polyfill time or throughput of the generated code. Possibly substantially, in large codebases. For example, if there is a significant increase in the number of variables in a large function (e.g. three address code temporaries), then experience has shown that that can make a massive impact on performance. Emscripten works extremely hard to minimize the total number of variables, spending much of its compile time on it, because we know it is worth it. In fact, in large-enough projects, often unoptimized or partially optimized builds (having the # of variables not fully reduced) are completely unusable. The complexity of such variable reduction is similar to that of register allocation; the problems are closely related.

In small codebases, this won't matter. But in large ones, I strongly suspect it will.

I'm not saying this is in itself a big enough issue that we should consider avoiding statements == expressions. I just want to point out that we may be moving to a model where we can't polyfill well enough on the client, compared to people making two builds. Two builds will just be better: no polyfill time, fully optimized code.

I hope this doesn't sound too pessimistic. I'm glad to hear of plans to work on the polyfill, and would like to help out.

kg commented 8 years ago

Haven't had time to catch up on this thread, but I wanted to point out that the entropy concerns re: break vs return are pretty minor. If it turns out to be an issue we can address it, but I really don't think it will (especially since we can now omit many returns entirely)

qwertie commented 8 years ago

If polyfilling is slow in certain situations (like when there is a loop inside an "if" condition), perhaps the code could be restructured in advance with a Wasm=>Wasm transform, rather than asking the polyfill do it. Presumably this will increase the code size slightly and decrease native Wasm performance slightly. Developers can stop using the restructuring option once most browsers have native Wasm.

kripken commented 8 years ago

It occurred to me that there are alternatives to "single build of wasm + polyfill to JS" and "2 builds, wasm and JS". For example, a single build that is not wasm, but something else that is easy to convert on the client to either wasm or asm.js. For example, the single build could be @lukewagner's asm.js polyfill prototype, or it could just be asm.js itself. This is practical because while we are making wasm => JS more complex, asm.js => wasm remains trivial.

Clearly this is not optimal, and has a bunch of obvious downsides. But it seems worth exploring, even if just as a fallback plan. And actually such a converter could have other benefits, like offline upgrading of legacy asm.js builds to wasm. I'll investigate.

@qwertie: Interesting idea!

lukewagner commented 8 years ago

The reasoning that convinces me in favor of stmts=exprs is that comma and conditional (which we know we want and are useful for building fatter exprs) are just block and conditional and so it seems pretty arbitrary to have expr versions of just these two but not the rest with the only reason being "b/c JS". If stmts=exprs made the polyfill untenable, then it would be a different story, but that seems far from the case.

sunfishcode commented 8 years ago

This has been implemented for some time now.