'retval' proposal re: calls and multiple return values

WebAssembly / design

WebAssembly Design Documents

http://webassembly.org

Apache License 2.0

11.41k stars 695 forks source link

'retval' proposal re: calls and multiple return values #280

Closed kg closed 9 years ago

kg commented 9 years ago

In #278, Dan suggested some good refinements for our thinking about calls and return values. Implied in this is the potential for multiple return values to show up in the future. I think it's relatively inevitable that multiple return values will creep their way in eventually, even if C/C++ never use them. In the event that they do, this has some nasty implications for our current model:

We need a way to represent the idea of an expression yielding multiple distinct values of different types. We need a way to handle those value tuples: Assign them out into locals, etc. A function needs a way to construct those value tuples so it can return them.

My suggestion for tackling this and some other issues is that we should define a distinct concept separate from local variables that I'll refer to as 'retval'. Effectively, retval is a tiny set of registers that contain the return value(s) of the most recent function call, with various restrictions to make it simple to reason about.

A rough sketch of how retval would work:

Every function call produces 0 - N return values. In the MVP, a function call is either void (no retvals) or has a single retval of type T.
After a function call, retval N is valid so long as N < the number of return values from the previous function. The type of this expression depends on the signature of the most recent function call, and it evaluates to the contents of that numbered return value register.
retval N can only be used once per return value per invocation; it consumes the result. To use a return value multiple times, store it into a local.
Crossing a branch invalidates the return values from the most recent function call.
- This is probably a pain from a compiler perspective but seems valuable in terms of making the runtime side simpler. No need for SSA or phis or whatever.
Call expressions with a single return value evaluate to the equivalent of (call @a ...), (retval 0) - essentially this is a simple transform that can happen in the compiler, decoder, or native runtime. I'd probably put it in the decoder.

So, (complete strawman), existing code that looks like this:

local @a
setlocal @a (call @sin (const 0.75f))
call @print (getlocal @a)

would look something like this:

call @sin (const 0.75f)
call @print (retval 0)

(EDIT) or this:

call @print (call @sin (const 0.75f))

So, why? Various reasons:

Without this design change, function calls end up producing a temporary local for every return value. This bloats the number of locals in large functions tremendously, unless compilers pool temporary locals, in which case...
- Now static analysis is more complicated because you have to SSA or GVN or some other horrifying three-letter-acronym to figure out the lifetime of return values, when stack/register space is needed, etc.
Transforming between nested and non-nested calls is simpler.
- In the old model, hoisting a nested function call out into a statement would require you to introduce temporary locals for its return values. In this model, you just undo the return value transform described above (call @a (call @b ...) -> call @b ...; call @a (retval 0)).
The lifetime of return values is trivial to reason about.
- It becomes trivial to identify unused return values. Just look for a retval.
- It becomes easier to do optimizations in the vein of 'move the call to the location where its result is used, if it's safe', etc
- It's way easier (IMO) to write a naive runtime with reasonable performance this way - you don't need to aggressively nuke temporaries quite so badly.
- It's easier to figure out whether to shove return values into a register.
Forwarding multiple return values into a call immediately afterwards becomes easier.
- call @a (call @b) is pretty easy to expand in the case that @b returns say, 3 values: call @b; call @a (retval 0) (retval 1) (retval 2).
Deduplication (nullary macros) becomes way more effective in this model.
- Common sub-expressions that operate on the result(s) of a function will trivially deduplicate, where before the expressions were operating on arbitrary numbered locals (or on function call expressions directly), which needs parameterized macros to deduplicate. Yuck.
- A compiler can generate whatever it wants, and a smart 'wasm optimizer' can attempt to convert calls between the two representations (inline vs retval) to see whether one of the two produces better deduplication results (and thus, smaller file size).

I think this is especially important to think about, because many future languages that we (probably) want to target wasm have multiple return values or something equivalent. Go extensively uses multiple return values, and C# somewhat-extensively uses something roughly equivalent (out parameters).

As far as returning multiple values from a function at once goes - I think we punt on that entirely until after the MVP, but this model probably makes sense there too (opcodes for storing values into your result registers, or something similar).

If people like this proposal I can try to draft up a PR for it later.

jfbastien commented 9 years ago

C++ does allow you to return multiple values with structs. You can even std::tie and use std::tuple!

I think the main issue you're discussing is: with just one return value calls only generate temporaries. With multiple, these temporaries become awkward.

(set_local @a (call foo @bar)) ; Make it a local.
(call bar (call foo @bar)) ; Notice the temporary!

Versus:

(set_locals (@a @b) (call baz @bar)) ; Maker two locals.

What if I want to use the two temporaries? Tough luck!

I'm not sure I like the proposed solution, but I can't think of any alternative for now...

titzer commented 9 years ago

For an outside datapoint, Virgil supports arbitrary tuples [ https://code.google.com/p/virgil/wiki/TutorialTuples] which are flattened in the middle of the compiler. That is the broadest generalization of value types that I can think of. It requires the compiler to support multiple-output value nodes, which TurboFan does to some extent.

Restricting multiple-value nodes in some sense would be good here, I think. I can imagine a baseline JIT or interpreter being very, very frustrated with generalized tuples.

That said, the implicit dataflow from the last call to the next retval usage in this proposal could cause some problems. If we can achieve a more explicit dataflow here, it will make things more composable.

On Thu, Jul 23, 2015 at 8:26 PM, JF Bastien notifications@github.com wrote:

C++ does allow you to return multiple values with structs. You can even std::tie and use std::tuple!

I think the main issue you're discussing is: with just one return value calls only generate temporaries. With multiple, these temporaries become awkward.

(set_local @a (call foo @bar)) ; Make it a local. (call bar (call foo @bar)) ; Notice the temporary!

Versus:

(set_locals (@a @b) (call baz @bar)) ; Maker two locals.

What if I want to use the two temporaries? Tough luck!

I'm not sure I like the proposed solution, but I can't think of any alternative for now...

— Reply to this email directly or view it on GitHub https://github.com/WebAssembly/design/issues/280#issuecomment-124197534.

kg commented 9 years ago

One mitigation for some of your concerns about the dataflow would be for retval to literally just be syntax sugar that can boil down to temporaries managed by the runtime/decoder instead of the compiler. That lets VMs decide how best to handle it. A naive JS polyfill would just generate locals for every function call, and then map the retvals to the appropriate locals, for example. We still get the size/clarity wins in the IR.

But if it just makes things harder on the runtime author, that's no good. Maybe there's a way to retain most of the size wins without doing that.

bar commented 9 years ago

@jfbastien what? why me?

jfbastien commented 9 years ago

I think what I want to avoid is creating first-class tuple types, and tuple types that can exist as locals. I feel like the retval idea tries to work around that restriction, but isn't really better than (tie local0 local1 (call multi_return_func args)).

I'm leaning towards liking tie to locals better...

kg commented 9 years ago

Well, retval has a few advantages that led me to propose it (smaller ast, more-compressible values, ability to discard return values) but (tie x y (call ...)) is definitely a minimal-pain solution. We can do that post-MVP when we actually do multiple return values, and just punt entirely on the issue for now.

titzer commented 9 years ago

On Thu, Jul 23, 2015 at 10:44 PM, JF Bastien notifications@github.com wrote:

I think what I want to avoid is creating first-class tuple types, and tuple types that can exist as locals. I feel like the retval idea tries to work around that restriction, but isn't really better than (tie local0 local1 (call multi_return_func args)).

Maybe we could have a multi-call bytecode that has the tie built into it?

I'm leaning towards liking tie to locals better...

— Reply to this email directly or view it on GitHub https://github.com/WebAssembly/design/issues/280#issuecomment-124234928.

qwertie commented 9 years ago

@jfbastien Why do you want to avoid creating first-class tuple types?

jfbastien commented 9 years ago

First-class tuples are the same as structs. The backend needs to start being smarter, and we need to define a bunch more than what we have now.

lukewagner commented 9 years ago

I'm concerned that retval seems to require some amount of control/flow analysis to validate which will complicate decoding (type-based validation has otherwise eliminated all flow-dependent validation criteria). To avoid the analysis, we could say that retval must immediately follow a call, but then we might as well call it a single opcode. I'm also concerned with composition: if I have a nested call f(g()) I can use the retval of g() as the arg of f(), but if I have f(g(), h()) then g()'s retval needs to be "spilled" to a local before it is clobbered by h() and now you have to think of retval as an optimization that you try to do before falling back to set_local and in general this makes me think of bad times with x86/ARM condition flags.

I do like the idea of combining the destructuring of a returned tuple with a call opcode as @titzer suggested. Two benefits:

aesthetically, it seems symmetric with arguments (where the creation of the args tuple also happens at the call)
practically, by forcing the returned tuple to be immediately consumed, we avoid having to implement annoying corner cases likes phis of tuples.

One issue is that call isn't the only opcode where we want multiple return values (think add_with_overflow or div_mod), so we'd have to explicitly do this for every multi-return opcode. While this sounds like duplication, I expect we could define it orthogonally to allow reuse in the spec and impl (kinda like how modr/m is (sorta, kinda) defined orthogonal to opcode in x86).

One idea that I didn't see brought up that I've been assuming for multi-return: Instead of requiring every element of the returned tuple to be assigned to locals, we could allow one element to be the result of the expression. For example, when compiling this clang C++:

return __builtin_uadd_overflow(x, y, &sum) ? UINT32_MAX : sum;

then the AST could be:

return (conditional (i32.add_with_overflow @x  @y  @sum) UINT32_MAX @sum)

which is pretty nice, I think. To be clear what's happening: as a binary-return op, i32.add_with_overflow takes 1 extra immediate operand (the index of the local to store the sum).

Addressing some specific points of the OP:

Without this design change, function calls end up producing a temporary local for every return value. This bloats the number of locals in large functions tremendously, unless compilers pool temporary locals, in which case. Now static analysis is more complicated because you have to SSA or GVN or some other horrifying three-letter-acronym to figure out the lifetime of return values, when stack/register space is needed, etc.

It is definitely the plan for LLVM to attempt to maximally reuse local slots. This is good for a number of reasons:

forgetting this multi-return issue, if allocations aren't pooled, the SSA form of LLVM would generate huge numbers of locals (one per def/phi)
a natural impl on a dumb compiler is going to give one stack slot per local, which means huge stack frames
for a dumb compiler on a register-rich arch, locals can be mapped to registers allowing the dumb compiler to reuse the regalloc done by the beefy AOT compiler
for a beefy compiler, various algorithms can have cost proportional to the number of locals (such as the structured, single-pass SSA generation in IonMonkey) and reducing the number of locals is already an optimization Emscripten does to reduce this compilation cost
no fancy analyses (SSA, GVN) are needed to deal with reuse of locals; b/c of (1) typed locals, (2) structured control flow, the simplest single-pass algorithm can easily track the current def associated with a local and generate SSA as it goes.

Thus, I think we are already assuming that locals were going to be aggressively reused and this was actually seen as a strength of our psuedo-register approach compared to mandating SSA form.

Transforming between nested and non-nested calls is simpler.

I think most of this analysis would be done in LLVM IR, before locals/wasm enter the picture. In general, any wasm compiler will invariable need the ability to decompose an expression using temporaries. A natural impl would be to maintain a logical stack of temporaries to maximize local reuse.

The lifetime of return values is trivial to reason about.

I still think it involves some small amount of analysis which means that a dumb compiler won't benefit from it (assuming the dumb compiler is doing single pass generate-code-as-you-go). OTOH, a beefy compiler won't have any problems with locals.

Forwarding multiple return values into a call immediately afterwards becomes easier.

It's important to qualify "easier". I think there is limited value to trying to make wasm more ergonomic to write by hand (that's definitely not a tier-1 priority compared to simplicity, speed of decoding and size). OTOH, "easier" could mean "requires less bytes", but I think there are multiple ways to achieve that goal in terms of tuples (think splat) if we measure this as a significant opportunity.

Deduplication (nullary macros) becomes way more effective in this model.

Maybe, but needs measurement and again I expect we could introduce operations on tuples to a similar effect if we saw there were significant savings from multi-return.

kg commented 9 years ago

Good points on all of this, Luke. Re: deduplication the advantage isn't from tuple operations, though, it's that it normalizes more expressions so that they all contain retval 0 and retval 1 instead of arbitrary locals. Once you get rid of that, every expression is using some arbitrary local that the call result(s) got spilled into, and the only way to deduplicate those expressions is with parameterized macros.

It's not clear to me how my proposal requires control flow analysis. The requirement that retval not persist across branches was designed to eliminate that. What did I overlook?

The point about nesting two function calls is a good one. There are a number of scenarios where you'd still need temporaries, and maybe that alone makes it problematic enough to not be worthwhile.

lukewagner commented 9 years ago

On the subject of deduplication: a compiler could take deduplication into account when assigning local variable indices. In the limit, it could effectively simulate retval by reserving certain locals for return values. Certainly more experimentation to do in this space.

You're right that the "control flow analysis" is super-simple given that it's intra-basic-block. I guess my spider senses were going off because it is a categorically different kind of validation criteria than the rest of our type-based validation which is only sensitive to the signature of ops and the types of child expressions.

kg commented 9 years ago

How would y'all feel about a variant of retval that is strongly typed? So instead of (retval 0), (retval 1), hypothetically:

(retval:int32 0), (retval:float32 0). So the type of a retval expression is always known and there's basically a block of virtual registers for function return values segmented by type. 'What happens with multiple calls used as arguments' is still a tricky problem but I think in that case you just fall back to storing into a local and it's no worse than things were before.

After some thought I think in practice retval is an alternate representation of temporary locals and function return values. The alternate representation is more regular so it compresses better, and it has additional constraints that can potentially make it easier for an engine to deal with (single-use, doesn't cross over branches, etc.) A naive implementation can just lazily convert function calls and retval expressions to always use locals under the hood (especially if the opcode is typed).

jfbastien commented 9 years ago

IIUC typed retval is indeed the same as storing to stored locals, but indeed has some regularity by being automatic. Sounds potentially worthwhile, though I wonder if it trades regularity for ease of optimization? Probably not.

lukewagner commented 9 years ago

If we set aside the size/compression argument, then I think it's simpler and more regular not to introduce a new special kind of local with special semantics.

If the only argument is size/compression, it seems like we need some motivating data and arguments why layer 1 (viz. macros) can't achieve the same effect. In particular, I think we could simulate retval at layer 1 by (1) reserving a set of locals to serve as retvals, (2) defining macros for any hot multi-return ops that name those locals. Honestly, I'm not expecting multi-returns to be a significant % of total bytes in any large app.

titzer commented 9 years ago

It's clear that multiple return values need some addition to the AST semantics we have now, and the solutions discussed lie roughly on a continuum of generality. I'd sketch the space a bit like this:

(GeneralTuples) On one end is general tuples that are flattened by the implementation to be scalars. That is the approach I took in Virgil, generalizing all functions to be single parameter, single return and mapping multi-argument functions onto functions that take and return tuples. That gives a really nice flexible source semantics, but the implementation burden is to flatten tuples and deal with them in the compiler IR. There's lots of fun with tuples inside of tuples, arrays of tuples, tuples of arrays, etc. The result is that the generated code is really efficient, as native calling conventions are used so that arguments and return values go in registers and on the stack, regardless of how complex the tuples at the source level are.

(ScalarsOnly) On the extreme other end is no support at all for tuples, only scalars like we have now. A function has a fixed number of scalar parameters and a fixed number of scalar return values. Local variables and expressions can only have scalar types.

(OnlyLocalTuples) In the middle of the continuum, we could allow tuples within a function body so that local variables and expressions can have tuple types. A call to a function that returns multiple values has a tuple type. That tuple could be stored in a local variable. Operations to access elements of tuples are required. If only flat (non-nested) tuples are allowed, then tuple access always produces a scalar.

(RetVal) is another solution in the middle. I see at as basically a short-distance tuple that expires after some intervening operation. Local variables still can only have scalar types, but the special "retval" expressions have tuple types. There are rules about when a "retval" can be used (e.g. some places are illegal), and the return values that a "retval" accesses are determined not by nesting or numbering, but implicitly through control flow.

I think we want to avoid (GeneralTuples). As I discussed above, it's a nice source-level semantics, but is kind of hell for implementations. I think we want to keep WebAssembly simple and machine-like, leaving overly-general source niceties up to a higher-level compiler. For point of reference, if I were to target Virgil to WebAssembly, I'd still flatten tuples in my compiler before generating wasm, so I don't think leaving generalized tuples out of wasm would preclude any nice source languages.

So without (GeneralTuples), we're left with (ScalarsOnly), (OnlyLocalTuples), and (RetVal).

I'd argue that (OnlyLocalTuples) has most of the same problems as (GeneralTuples), even if we restrict the local tuples to be flat. Implementations still have to track the types of local variables and expressions when they are tuples, check tuple accesses, and map a local tuple to multiple machine registers or stack slots. It probably makes a baseline JIT even more tricky.

That leaves (ScalarsOnly) and (RetVal), which is basically where this discussion is now.

Luke and I seem to be leaning towards (ScalarsOnly) with the assumption that we need a special "call_multiple" expression that assigns to multiple locals. This design at least preserves the nice properties of only having scalar types for local variables and expressions with the potential cost of forcing wasm producers to create lots of temporaries and encode big-fat call_multiple bytecodes.

StrawMan1:

call_multiple %0 %1...%K [func] [arg0] [arg1]...[argN]

The semantics would be to call the function [func] either directly or indirectly, with the results of evaluating arg0...argN, and then assign the results to temporaries %0 to %K. The overall result of the call_multiple expression would be the value %0.

StrawMan2:

call_multiple %1...%K [func] [arg0] [arg1]...[argN]

The semantics would be to call the function [func] either directly or indirectly, with the results of evaluating arg0...argN, and then assign the results to temporaries %1 to %K, with the difference being that the first return value does not require a temporary. The overall result of the call_multiple expression would be the first return value of the call (i.e. what would have been %0 in StrawMan1). StrawMan2 is potentially more compact than StrawMan1, since the first return value is implicitly returned, and we thus need 1 less temporary. For a likely case of a 2-tuple return value, this only requires a single extra local.

On Wed, Jul 29, 2015 at 9:46 AM, Luke Wagner notifications@github.com wrote:

If we set aside the size/compression argument, then I think it's simpler and more regular not to introduce a new special kind of local with special semantics.

If the only argument is size/compression, it seems like we need some motivating data and arguments why layer 1 (viz. macros) can't achieve the same effect. In particular, I think we could simulate retval at layer 1 by (1) reserving a set of locals to serve as retvals, (2) defining macros for any hot multi-return ops that name those locals. Honestly, I'm not expecting multi-returns to be a significant % of total bytes in any large app.

— Reply to this email directly or view it on GitHub https://github.com/WebAssembly/design/issues/280#issuecomment-125871913.

sunfishcode commented 9 years ago

I like StrawMan2. While I see the appeal of StrawMan1 in allowing all return values to be handled consistently as named variables, this can be achieved with StrawMan2 by adding a set_local:

set_local %0, (call_multiple %1...%K [func] [arg0] [arg1]...[argN])

which is slightly less pretty but achieves the same effect. And, it avoids the awkwardness in StrawMan1 of producing a value into two output locations.

lukewagner commented 9 years ago

Agreed

rossberg commented 9 years ago

FWIW, the ML prototype currently implements a simpler variation of (OnlyLocalTuples) that I would call (SecondClassTuples), and which doesn't have most of the problems @titzer mentions.

In that approach, values or variables cannot have tuple type, only expressions can. In general, any expression produces a sequence of values, most of them unary or empty. Calls can produce n-ary expression type. Such types can be consumed by either a Destruct operator (an n-ary SetLocal), by Return, or if you want, by other Calls (as their argument).

It is not possible to get a first-class handle on the tuple itself. Nor can tuples be nested, because their components are values.

That's more modular in several ways: (1) it allows multiple ways of consuming tuples; (2) it also allows introducing additional expressions that produce n-ary results; (3) it allows instructions that are generic in arity. In particular, it doesn't require introducing a _multiple variant of each Call instruction (with the strawmen above, you'd seem to need to add at least both call_multiple and call_indirect_multiple?).

titzer commented 9 years ago

On Fri, Aug 14, 2015 at 9:37 AM, rossberg-chromium <notifications@github.com

wrote:

FWIW, the ML prototype currently implements a simpler variation of (OnlyLocalTuples) that I would call (SecondClassTuples), and which doesn't have most of the problems @titzer https://github.com/titzer mentions.

In that approach, values or variables cannot have tuple type, only expressions can. In general, any expression produces a sequence of values, most of them unary or empty. Calls can produce n-ary expression type. Such types can be consumed by either a Destruct operator (an n-ary SetLocal), by Return, or if you want, by other Calls (as their argument).

It is not possible to get a first-class handle on the tuple itself. Nor can tuples be nested, because their components are values.

That's more modular in several ways: (1) it allows multiple ways of consuming tuples; (2) it also allows introducing additional expressions that produce n-ary results; (3) it allows instructions that are generic in arity. In particular, it doesn't require introducing a _multiple variant of each Call instruction (with the strawmen above, you'd seem to need to add at least both call_multiple and call_indirect_multiple?).

If expressions can produce tuples, it means that decoders have to track the number of outputs from expressions. Concretely, the decoder stack (and potential SSA renaming environment) tracked during descent is no longer unary. That's a pain point.

I don't think we gain (or need) much expressive power by allowing tuples to feed directly into other calls. StrawMan1/2 basically build Destruct directly into CallMultiple.

— Reply to this email directly or view it on GitHub https://github.com/WebAssembly/design/issues/280#issuecomment-131006087.

rossberg commented 9 years ago

A couple of things you could gain in the future are e.g. primitive operators with multiple results (e.g. divmod), or potential forms of control transfer with multiple values (e.g. exceptions or a generalised Break). None of these are relevant for the MVP, but it would still be nice to have a factorisation that is somewhat future-proof (and avoids duplication of opcodes).

As for decoding complications: true. Not sure that is too terrible, though. Moreover, it could be avoided (for most part) by disallowing multiple values to flow through other expressions (such as conditionals), in which case only the top of the stack could be n-ary. That leaves the option to generalise later.

lukewagner commented 9 years ago

Definitely interested in multiple-return primitive values, but it seems like these ops could just do the same thing as call_multiple which is to take the outparams local index as an operand (this was my i32.add_with_overflow example above).

The more interesting case as you point out is if we had generalized tuple return from control flow primitives. @sunfishcode and I were talking about this a long time ago in the context of reducing the high % of bytes spent on get_local/set_local by trying to give more opportunity to build bigger trees. It'd be nice to do some experiments once llvm-wasm is emitting full programs to see how much this would help in practice.