Open xphung opened 4 years ago
Agreed this sounds like a good idea, it matches nicely with the way WebAssembly already uses the stack and locals. When you say it's different to let
, what do you mean? It sounds similar to the way let
is described in this document: https://github.com/WebAssembly/function-references/blob/master/proposals/function-references/Overview.md.
whereas "frame"/"end" will tear down the stack values supplied as input into "frame", without any additional instructions)
This is unlike other WebAssembly block functions, which don't tear down their stack implicitly. I think it would be more natural for this to require a br 0
to clean up the stack. It does mean that the instruction sequence doesn't quite match an inlined function.
inside the "frame" block, can't access locals which were defined outside the block, unless these are passed in as parameters
The let
described above appends locals to current list of locals, so it would still be possible to access locals from nested lets or the original function definition. Is there a benefit to disallowing access to outer locals?
inside the "frame" block, can't access locals which were defined outside the block, unless these are passed in as parameters
I think that would be very unpractical. We might have to define it such that each nested frame
adds N locals to the existing set of locals. But now you're talking about a much more heavyweight feature for engines to support it.
I generally do not see the advantage of this over pick
and friends. To me, the whole reason you might want pick
over getlocal
in the first place is that getlocal
is often very redundant: we have values on the stack, we want to consume values on the stack, but in between sit a bunch of getlocal
s whose sole purpose it is to recreate that stack.
Compare this hypothetical example:
i32.const 1
i32.const 2
call foo
func foo (i32 i32)
# first the args were on the stack, then we move them into locals, and now we're putting them back on the stack!
local.get 1
local.get 0
i32.add
# here we need to remove locals (and maybe args)
end
# vs a hypothetical foo that does not put args in locals, but just leaves them on the stack.
func foo (i32 i32)
# just as an example, assume args were in the wrong order, so we pick, or swap
pick 1
i32.add
# no cleanup required.
# end
end
But now you're talking about a much more heavyweight feature for engines to support it.
Is it? In many ways it fits nicely with what engines already have to do with block parameters in the multi-value proposal. The difference is that now those values can also be accessed by local.get
, etc. But it's always statically known whether those accesses are to function params, locals, or stack values.
I generally do not see the advantage of this over pick and friends.
See some of the discussion here, in particular @rossberg's comment:
I think both complement each other: pick works well for a one-off access into the stack. But for accessing the same slot multiple times, or multiple related accesses, or operand reordering, stack twiddling with pick and swap is more difficult to write, more difficult to read and debug, and I suspect often less compact.
I generally do not see the advantage of this over
pick
and friends. To me, the whole reason you might wantpick
overgetlocal
in the first place is thatgetlocal
is often very redundant: we have values on the stack, we want to consume values on the stack, but in between sit a bunch ofgetlocal
s whose sole purpose it is to recreate that stack.;; vs a hypothetical foo that does not put args in locals, but just leaves them on the stack. func foo (i32 i32) ;; just as an example, assume args were in the wrong order, so we pick, or swap pick 1 i32.add ;; no cleanup required. end end
If wasm had "pick" from the start in MVP v1.0 (and left fn parameters on stack rather than mapping them into locals), then I agree my "frame" proposal wouldn't be as useful as "pick".
I don't think getlocal/setlocal will be deprecated any time soon though...
For better or worse, wasm today is a "hybrid" stack machine that is [stack]+[local vars]+[params in locals]. It manages this complexity by adhering strictly to stack as top-only access, ie: the wasm stack is only push/pop and doesn't try to do random access. IMO, "pick" will blur the conceptual distinction between stack vs local vars, ie: both will have random access.
Also, what the "pick" example above doesn't show is after blocks/funcs returning multi-values, extra "drop" instructions may be needed to remove left over stack values. (There is likely to be at least one left over value, as otherwise pick wouldn't be needed, to duplicate values into correct stack order).
In addition to "pick", a "swap" will also be needed to restructure the stack sequence to comply with the end of block stack signature.... once the extra swaps/drops are all added, is there really any code density benefit with providing "pick"? IMO, we need to be very clear about this before we throw out the conceptual "purity" of wasm's stack (of not providing random access to the stack).
Looking at @rossberg 's proposal for "let", it is similar as what I propose for frame, but goes an additional step of remapping outer locals into local #[orig local idx + N], where N is the locals declared by an inner let.
For CPU's with plentiful registers, the wasm "stack" is really a CPU register "window" scheme, ie: a way of directing which CPU register the next wasm arithmetic instruction will operate on. (Local vars deeper inside the stack are then candidates for a "simplistic" JIT to spill into RAM).
What both "frame" and "let" are doing (which "pick" doesn't do) is indicating upfront to a simplistic JIT compiler to pin down & not to spill certain stack slots for the lifetime of the block. However, "let" doesn't convey as much information on what local vars are safe to spill to RAM (as any outer local var can still be accessed inside), whereas "frame" is very explicit in telling the JIT upfront which local vars to keep inside CPU registers. (Effectively the outer local var "push" needed by "frame" is like the "register" keyword in C, it's an advisory hint to the JIT to keep the var in a CPU register... if these hints are not useful to real JIT's, as opposed to a simplistic JIT, then Rossberg's proposal is superior to my proposal - as it has better encoding density).
Also, my own envisaged use case for "frame" is that it will process/re-sequence a set of multi-values, which then will be handed over to the outer block to complete processing on... I am guessing in most cases, the processing done inside "frame" is likely to be small and simple and will not need to access many outer vars, but it would be interesting to see data on this.
Hi, I can't resist thinking more about the "pick" vs "let" duality, and whether there is a case to have both pick & let.
I remain a proponent of providing "let" and not providing "pick" (I've already more or less conceded above that @rossberg 's "let" is superior to my "frame").
However, here is a suggestion: provide "let" and "dup", ie: pick #0 - but not a generalised pick. dup is a top-of-stack only operation, so it doesn't change the existing wasm model - that is, the wasm stack is limited to non-random top of stack ops, and local vars are to be the mechanism for random access.
There is possibly even a case for providing a "swap #1" - ie: limited to top of stack, so only swapping top two slots of the stack.
Limiting dup and swap to only the top of stack means we don't need to encode an immediate argument for these instructions, so they will only be 1 byte rather than 2+ bytes.
This is a win for code density - as well as the keeping the stack conceptually pure, by not providing random access to deeper slots in the stack.
I suspect the 80/20 law applies to this case - ie: 80% of the use cases of a hypothetical generalised pick/swap are handled by a more simple dup/swap, without immediate arguments. The other more complex 20% are then better addressed by "let" (making generalised pick/swap unnecessary).
I think this is an interesting idea, but it would be much more compelling with data. let
was useful for the function-references proposal because it provides a real value -- being able to define a local variable without a default. But it seems dup
and swap
would be purely for optimizing code size, so it would be useful to know how much savings we could expect to achieve. That said, I'm not sure the easiest way to gather this kind of information... perhaps modifying binaryen to understand these new instructions so it can "re-stackify" a function.
I think this is an interesting idea, but it would be much more compelling with data.
let
was useful for the function-references proposal because it provides a real value -- being able to define a local variable without a default. But it seemsdup
andswap
would be purely for optimizing code size, so it would be useful to know how much savings we could expect to achieve. That said, I'm not sure the easiest way to gather this kind of information...
Java VM and CIL data might be useful here.
Java VM appears to have a number of variants of dup. These insert the top value of the stack into slots 2-3 levels down. It also has swap, but no pick (anyone with more detail JVM knowledge correct me if I am wrong). Hence, profiling instruction counts of the "dup" variants could give a frequency histogram of stack manipulation by slot dept (0 vs 1 vs 2 vs 3) but this would be a "reverse" of the pick data we want (so not perfect)
CIL has dup, but does not seem to have swap or pick (again, anyone with detailed CIL knowledge correct me if wrong). So CIL won't be helpful in giving data on need for pick beyond swap, although presumably the incidence of multiple pops followed by dup might be very approximate hint on this.
If there is interest in the JVM instruction count data, then I could do some profiling of some java classes to get the dup_XX instruction counts.
Sounds potentially useful, but I'm note quite sure how to translate that info into wasm binary savings. I guess if we knew how common dup
is in JVM, and then the relative code density of the two formats, then we could make a guess. In wasm currently, a dup
could be implemented by adding a local, then using local.tee
followed by local.get
:
(local $new i32)
...
(local.tee $new)
(local.get $new)
The new local is practically free (just requires incrementing the local count for that value type, so requires an extra byte only at 27n boundary). The local.tee
and local.get
instructions are 2 to 6 bytes each, but 2 is much more likely. So we can expect dup
to save at least 3 bytes every time it's required. Maybe that will give us a good ballpark.
Instead of providing "pick" or "let" instructions, I propose providing a "frame" block instruction (effectively providing frame-pointer like access to the wasm stack, similar to what happens for the "call" instruction), eg:
instead of: i32.const 100 ;; push some values onto stack i32.const 200 i32.const 300 pick 2 ;; result is 100
you would have: i32.const 100 ;; push some values onto stack like an inline function call i32.const 200 i32.const 300 frame param(i32 i32 i32) result(i32) ;; above 3 stack values are now locals 0,1,2 (equivalent to a call into an inlined function) getlocal 0 ;; result is 100 end ;; stack frame is torn down, and all associated param values (100,200,300) are dropped from stack ;; (equivalent to return from an inline function)
Advantages over "pick" and "let":
Disadvantages: