Closed ghost closed 8 years ago
I'm not sure I understand the data you're quoting when you say "the block top level expression operators that are discarded".
set_local 236239 50.94
Is that 236239
total uses of set_local
, 50.94%
of total instructions? What percentage of set_local
are discarded?
I'm also not sure the binary was well optimized. Are there missing opportunities? We can't decide to remove the feature if we think we haven't done a proper effort of optimizing it.
@jfbastien Walked the wast, and at a block
walked the top level expressions and noted the immediate operation (the results of which are discarded), and counted them. For example (block (set_local ...) (i32.store ...) (set_local ...) (call ...) (i32.store ...))
. So there were 236239 set_local
operations in the block top level expression position, the results of which were discarded. Some of the call
and other operators in the top level position might have returned zero values too and then the decoder could have accounted for them too in a compressed count, but this was not checked. This does not seem to be a 'missing' optimization issue, just natural blocks of effective statements. This is just my input into the ongoing discussion, and I am interest what other solutions people have, but I think this solution has some merit.
This issue also affects one-pass compilation from wasm to machine code, because every set_local
, store
, and also call
and friends can produce a value whether it's wanted or not, so a single-pass consumer has to assume it's needed and allocate storage for it.
I agree that it's worth considering void-returning set_local
, store
, and also variants of call
and friends that return no value even when the function being called has a return value. With these operators, it would be practical to avoid ever having any dead values on the decoding stack.
@sunfishcode: I don't understand the storage overhead you mention. Isn't it the case that e.g. a store
return value can only be ignored when in a block, for example? In that case, we need to allocate room for it in the block, just like every other instruction - what is being wasted here?
The only not immediately obvious case seems to be a store
at the end of a block (we need to look lower to see if it's actually going to be consumed), but that's very rare I think.
@kripken With the post order encoding a single pass compiler does not know if the result is used until the end of the block, so would need to keep around all results of every store and set_local. Could add block
to this list or a block0
variant, and also bring forward the call_multiple
variant that writes results to local variables so that it does not return expression results. Also bring forward the br0
and br0_if
variants just for encoding efficiency. Could consider an if0
too, but I guess it is unnecessary anyway and producers could use block0
br0_if
br0
for diamonds. btw: the expression-less encoding I am exploring is beating the current encoding now in terms encoded file size, and the changes in this issue move in that direction
By allocate storage, I mean in terms of code generation -- it has to allocate a register or some stack space to hold the value. Also, it appears that I neglected to mention that this concern is specific to postorder encoding. In postorder, a single-pass consumer looking at a particular operation can't tell if that operation's result will be used.
This isn't important for operators with no side effects, because reasonable consumersproducers shouldn't leave behind obviously dead code, but currently it's unavoidable in set_local, store, call, and friends.
Ok, thanks, now I see - both of you are concerned with something like a baseline compiler.
This seems like a valid concern, however, the perf loss in baseline vs the downsides of splitting e.g. set_local
into set_local
and set_local_noreturn
would need to be compared.
I don't know what the downsides are of having the extra opcode (spec complexity sure, because as @JSStats points out there are probably several opcodes that need this split, but anything else?) but the baseline compiler is doing strictly more work both at compile time (register / stack management) and run time (register shuffling / memory traffic) without the information. That said, I think it's a fruitful question whether an efficient one-pass compiler needs to be supported or if a 1.5-pass compiler (that builds partial trees until it has enough information to discard them) is an acceptable trade-off for good code.
We just need data on this of course, but the main downside that concerns me is splitting the super-common set_local node. If right now is has opcodes X with super-high frequency, then if we split we'd have X1, X2, with lower frequencies, which might be bad for compression.
edit: get_local wasn't relevant
Another alternative would be to add a discard
operator. This way we could keep set_local
and get_local
as they are.
True. But then we'd have X, DX (if "D" is the discard prefix code), which I suspect is also less efficient than X.
I think the core issue is that this information is already present in the AST, so splitting the opcode or adding a discard prefix opcode is redundant information, that we just add for a single-pass compiler.
I got some quick data now. I measured what I think is the worst case, where half of set_locals get the new opcode or a prefix opcode. The result on BananaBread is a 5.2% size increase after gzip with splitting, and 6.7% with adding a discard prefix code.
How did you pick which set_locals get the new opcode?
Just the simplest silly cycling between the two.
It's possible that silly cycling could be biased against compression though. Binaryen knows which set_locals have an unused result, so would it possible to use that instead?
@JSStats I'm seeing different result than you are. I just instrumented the sexpr-wasm interpreter to see how many set_local
values are discarded, and found that 90.39% of them are in AngryBots. (I also see 580108 set_locals, not 236239 / 0.5094 == 463759.)
edit: oh, I think I misread the numbers. You were calculating it as a percentage of total discarded values? Still, I'm wondering why I'm seeing so many more discarded set_local
values than you are.
@sunfishcode: That usage info is not trivially available in the AST, which is why I grabbed the quickest data first. But, I think it's also a good worst-case estimate, both because cycling keeps them mixed, and because a 50-50 distribution maximizes the entropy.
I also meanwhile found a way to use the usage information in wasm-as. With that, I see a 1.1% size increase on BB and AB.
So we are probably looking at something around a 1.1% gzip size increase, and very probably less than a 5.2% increase, in return for some amount of benefit in single-pass compilers. I guess the question is then, how much of a benefit in them, and how much do they matter?
@binji: Which AB wast is that on? Current binaryen can "sink" set_locals when allowable (this found some places where the wasm backend wasn't doing all it could). As a result, older binaryen has few uses of set_local values, but newer binaryen has more. (to get the optimized version, can do something like binaryen-shell AB.wast -O --print &> AB_new.wast
; I'd update the build-suite, but I assume some people want the some one for consistency)
@kripken Yes, I was using the file from the build suite. OK, it's much better now, looks like 41.95% of set_local
values are discarded (242251 / 577477).
In case it helps the discussion, here is a version of AngryBots wasm that has been flattened into single statements, no if
or if_else
either. Some immediate constants are allowed, they map to opcodes accepting immediate constants in the encoding I am exploring, and this helps the encoded size.
The uncompressed size is 40% larger, and brotli compressed size is 26% larger. But adding operator variations that accept immediate locals and some immediate constants addresses much of this bloat to 8% smaller uncompressed and 13% larger compressed.
Further a simple local variable predictor is relatively successful, around 50% hit ratio, and seems to further address the bloat. The predictor tracks the index of the last local variable written of each type during encoding, and assumes arguments read the last local written and are also single use so after reading each argument the count decreases, and they are assumed to be read in reverse order as if being pop from a stack in the local variables. The operator results are predicted to write to the next local, and the count is reset with each write. If the locals are allocated in an expression stack pattern then the hit ratio is very high for this predictor. It can be exploited easily by assigning zero to a hit and leaving it to the compressor to exploit, and there might be some merit in special opcodes that assume a hit (around 25% of operators in AngryBots).
Some of the bloat is due to flattening if
and if_else
to block
br_if
and br
. It's not so bad using if
and if_else
34% 23%. There is probably some alternative representation that could address this too. Using operators with immediate-local-variables and some immediate constants, plus prediction of local variables, and keeping the if
and if_else
for control flow only, the experimental encoding is giving an 8% smaller uncompressed file size and a 6% smaller brotli compressed file size (unverified).
I'll try to clear up the encoder and write a decoder to losslessly encode and decode the above wasm file to check and prove the validity of an encoding, but you can see from this wasm file that the prospects of encoding an expression-less code is good. It might still remain to be shown that a decoder is fast, and that asm.js can be converted to this easily.
Just want to reiterate that this solution does not require removing the empty/void type. The results of nop
or a discard
can still be consumed by operators that accept zero values, so still need to be represented on the expression stack until the end of a block. The solution proposed is to compress these and bias the top-level expressions to producing mainly zero-values. Ignoring a type does not make it go away!
Discarding results is looking like a big issue with the post-order encoding. Looking into a register base encoding (or macros) suggests this would be an extreme problem, for example if expressions were flattened into macros that expanded into say
(set_local $l1 (i32.add (get_local $i2) (get_local $l3)))
etc then there will be a lot of expression results to discard in blocks!A quick look at AngryBots gives the following as the block top level expression operators that are discarded. So
set_local
and store variants that do not return a value would address much of this issue.A post-order decoder could then compress the stack of empty value results, recording just a count of how many there are, which should mostly address this problem. For example if there is a block of contiguous expressions returning zero values then it would just have a single stack entry count for this on the stack, and if there were some sparse expressions returning values then it should still compress well.
It seems to becoming clear that expressions in wasm have limits, and perhaps there is no need to pass values through
set_local
or the store operators, and perhaps it is reasonable to expect wasm code to use local variables for these use cases.This change might also set up wasm to consider register base operators too, that write to a register and return no values and that could be use in abundance as top level block statements.
What do people think?