Change the set_local and the store operators to return no result values.

ghost commented 8 years ago

Discarding results is looking like a big issue with the post-order encoding. Looking into a register base encoding (or macros) suggests this would be an extreme problem, for example if expressions were flattened into macros that expanded into say (set_local $l1 (i32.add (get_local $i2) (get_local $l3))) etc then there will be a lot of expression results to discard in blocks!

A quick look at AngryBots gives the following as the block top level expression operators that are discarded. So set_local and store variants that do not return a value would address much of this issue.

set_local                      236239    50.94
i32.store                      104733    22.58
call                           42270     9.11
if                             19689     4.25
block                          15481     3.34
f32.store                      15298     3.30
if_else                         8669     1.87
i32.store8                      6973     1.50
loop                            4845     1.04
call_import                     2958     0.64
call_indirect                   2442     0.53
br_if                           2375     0.51
i32.store16                     1704     0.37
f64.store                        121     0.03

A post-order decoder could then compress the stack of empty value results, recording just a count of how many there are, which should mostly address this problem. For example if there is a block of contiguous expressions returning zero values then it would just have a single stack entry count for this on the stack, and if there were some sparse expressions returning values then it should still compress well.

It seems to becoming clear that expressions in wasm have limits, and perhaps there is no need to pass values through set_local or the store operators, and perhaps it is reasonable to expect wasm code to use local variables for these use cases.

This change might also set up wasm to consider register base operators too, that write to a register and return no values and that could be use in abundance as top level block statements.

What do people think?

jfbastien commented 8 years ago

I'm not sure I understand the data you're quoting when you say "the block top level expression operators that are discarded".

set_local                      236239    50.94

Is that 236239 total uses of set_local, 50.94% of total instructions? What percentage of set_local are discarded?

I'm also not sure the binary was well optimized. Are there missing opportunities? We can't decide to remove the feature if we think we haven't done a proper effort of optimizing it.

ghost commented 8 years ago

@jfbastien Walked the wast, and at a block walked the top level expressions and noted the immediate operation (the results of which are discarded), and counted them. For example (block (set_local ...) (i32.store ...) (set_local ...) (call ...) (i32.store ...)). So there were 236239 set_local operations in the block top level expression position, the results of which were discarded. Some of the call and other operators in the top level position might have returned zero values too and then the decoder could have accounted for them too in a compressed count, but this was not checked. This does not seem to be a 'missing' optimization issue, just natural blocks of effective statements. This is just my input into the ongoing discussion, and I am interest what other solutions people have, but I think this solution has some merit.

sunfishcode commented 8 years ago

This issue also affects one-pass compilation from wasm to machine code, because every set_local, store, and also call and friends can produce a value whether it's wanted or not, so a single-pass consumer has to assume it's needed and allocate storage for it.

I agree that it's worth considering void-returning set_local, store, and also variants of call and friends that return no value even when the function being called has a return value. With these operators, it would be practical to avoid ever having any dead values on the decoding stack.

kripken commented 8 years ago

@sunfishcode: I don't understand the storage overhead you mention. Isn't it the case that e.g. a store return value can only be ignored when in a block, for example? In that case, we need to allocate room for it in the block, just like every other instruction - what is being wasted here?

The only not immediately obvious case seems to be a store at the end of a block (we need to look lower to see if it's actually going to be consumed), but that's very rare I think.

ghost commented 8 years ago

@kripken With the post order encoding a single pass compiler does not know if the result is used until the end of the block, so would need to keep around all results of every store and set_local. Could add block to this list or a block0 variant, and also bring forward the call_multiple variant that writes results to local variables so that it does not return expression results. Also bring forward the br0 and br0_if variants just for encoding efficiency. Could consider an if0 too, but I guess it is unnecessary anyway and producers could use block0 br0_if br0 for diamonds. btw: the expression-less encoding I am exploring is beating the current encoding now in terms encoded file size, and the changes in this issue move in that direction

sunfishcode commented 8 years ago

By allocate storage, I mean in terms of code generation -- it has to allocate a register or some stack space to hold the value. Also, it appears that I neglected to mention that this concern is specific to postorder encoding. In postorder, a single-pass consumer looking at a particular operation can't tell if that operation's result will be used.

This isn't important for operators with no side effects, because reasonable ~~consumers~~producers shouldn't leave behind obviously dead code, but currently it's unavoidable in set_local, store, call, and friends.

kripken commented 8 years ago

Ok, thanks, now I see - both of you are concerned with something like a baseline compiler.

This seems like a valid concern, however, the perf loss in baseline vs the downsides of splitting e.g. set_local into set_local and set_local_noreturn would need to be compared.

lars-t-hansen commented 8 years ago

I don't know what the downsides are of having the extra opcode (spec complexity sure, because as @JSStats points out there are probably several opcodes that need this split, but anything else?) but the baseline compiler is doing strictly more work both at compile time (register / stack management) and run time (register shuffling / memory traffic) without the information. That said, I think it's a fruitful question whether an efficient one-pass compiler needs to be supported or if a 1.5-pass compiler (that builds partial trees until it has enough information to discard them) is an acceptable trade-off for good code.

kripken commented 8 years ago

We just need data on this of course, but the main downside that concerns me is splitting the super-common set_local node. If right now is has opcodes X with super-high frequency, then if we split we'd have X1, X2, with lower frequencies, which might be bad for compression.

edit: get_local wasn't relevant

binji commented 8 years ago

Another alternative would be to add a discard operator. This way we could keep set_local and get_local as they are.

kripken commented 8 years ago

True. But then we'd have X, DX (if "D" is the discard prefix code), which I suspect is also less efficient than X.

I think the core issue is that this information is already present in the AST, so splitting the opcode or adding a discard prefix opcode is redundant information, that we just add for a single-pass compiler.

I got some quick data now. I measured what I think is the worst case, where half of set_locals get the new opcode or a prefix opcode. The result on BananaBread is a 5.2% size increase after gzip with splitting, and 6.7% with adding a discard prefix code.

sunfishcode commented 8 years ago

How did you pick which set_locals get the new opcode?

kripken commented 8 years ago

Just the simplest silly cycling between the two.

sunfishcode commented 8 years ago

It's possible that silly cycling could be biased against compression though. Binaryen knows which set_locals have an unused result, so would it possible to use that instead?

binji commented 8 years ago

@JSStats I'm seeing different result than you are. I just instrumented the sexpr-wasm interpreter to see how many set_local values are discarded, and found that 90.39% of them are in AngryBots. (I also see 580108 set_locals, not 236239 / 0.5094 == 463759.)

edit: oh, I think I misread the numbers. You were calculating it as a percentage of total discarded values? Still, I'm wondering why I'm seeing so many more discarded set_local values than you are.

kripken commented 8 years ago

@sunfishcode: That usage info is not trivially available in the AST, which is why I grabbed the quickest data first. But, I think it's also a good worst-case estimate, both because cycling keeps them mixed, and because a 50-50 distribution maximizes the entropy.

I also meanwhile found a way to use the usage information in wasm-as. With that, I see a 1.1% size increase on BB and AB.

So we are probably looking at something around a 1.1% gzip size increase, and very probably less than a 5.2% increase, in return for some amount of benefit in single-pass compilers. I guess the question is then, how much of a benefit in them, and how much do they matter?

@binji: Which AB wast is that on? Current binaryen can "sink" set_locals when allowable (this found some places where the wasm backend wasn't doing all it could). As a result, older binaryen has few uses of set_local values, but newer binaryen has more. (to get the optimized version, can do something like binaryen-shell AB.wast -O --print &> AB_new.wast; I'd update the build-suite, but I assume some people want the some one for consistency)

binji commented 8 years ago

@kripken Yes, I was using the file from the build suite. OK, it's much better now, looks like 41.95% of set_local values are discarded (242251 / 577477).

ghost commented 8 years ago

In case it helps the discussion, here is a version of AngryBots wasm that has been flattened into single statements, no if or if_else either. Some immediate constants are allowed, they map to opcodes accepting immediate constants in the encoding I am exploring, and this helps the encoded size.

The uncompressed size is 40% larger, and brotli compressed size is 26% larger. But adding operator variations that accept immediate locals and some immediate constants addresses much of this bloat to 8% smaller uncompressed and 13% larger compressed.

Further a simple local variable predictor is relatively successful, around 50% hit ratio, and seems to further address the bloat. The predictor tracks the index of the last local variable written of each type during encoding, and assumes arguments read the last local written and are also single use so after reading each argument the count decreases, and they are assumed to be read in reverse order as if being pop from a stack in the local variables. The operator results are predicted to write to the next local, and the count is reset with each write. If the locals are allocated in an expression stack pattern then the hit ratio is very high for this predictor. It can be exploited easily by assigning zero to a hit and leaving it to the compressor to exploit, and there might be some merit in special opcodes that assume a hit (around 25% of operators in AngryBots).

Some of the bloat is due to flattening if and if_else to block br_if and br. It's not so bad using if and if_else 34% 23%. There is probably some alternative representation that could address this too. Using operators with immediate-local-variables and some immediate constants, plus prediction of local variables, and keeping the if and if_else for control flow only, the experimental encoding is giving an 8% smaller uncompressed file size and a 6% smaller brotli compressed file size (unverified).

I'll try to clear up the encoder and write a decoder to losslessly encode and decode the above wasm file to check and prove the validity of an encoding, but you can see from this wasm file that the prospects of encoding an expression-less code is good. It might still remain to be shown that a decoder is fast, and that asm.js can be converted to this easily.

ghost commented 8 years ago

Just want to reiterate that this solution does not require removing the empty/void type. The results of nop or a discard can still be consumed by operators that accept zero values, so still need to be represented on the expression stack until the end of a block. The solution proposed is to compress these and bias the top-level expressions to producing mainly zero-values. Ignoring a type does not make it go away!

WebAssembly / design

Change the set_local and the store operators to return no result values. #655