Augment load/store instructions w/ index and scale

binji commented 5 years ago

@aardappel and I were discussing this briefly today:

If you were to compile the following array access:

uint32_t array[100];
array[x]

in current wasm, you have to generate something like this:

0000: 20 00    ; local.get $array
0002: 20 01    ; local.get $x
0004: 41 02    ; i32.const 2
0006: 74       ; i32.shl
0007: 6a       ; i32.add
0008: 28 02 00 ; i32.load align=2 offset=0
000b:

Since this is such a common operation, it would be nice to have a dedicated instruction for this. We could encode an immediate scale value by using unused alignment bits, since the current spec requires that the alignment value is not larger than natural alignment for the access width.

So for i32.load, we could encode the new alignment immediate as follows:

...000sssssaa

Where aa is the alignment value and sssss provides a constant shift value in the range [0, 32).

This scale is not very useful without a base value, so providing a non-zero sssss value would change the stack signature of the load from [base i32] -> [i32] to [base i32, index i32] -> [i32].

(We may instead want the scale to be sssss - 1, so we can perform base + index*1 and because sssss == 31 is not very valuable.)

With this new instruction, we can generate a nicer instruction sequence:

0000: 20 00    ; local.get $array
0002: 20 01    ; local.get $x
0004: 28 0a 00 ; i32.load scale=2 align=2 offset=0
0007:

saving 4 bytes. It has some other nice benefits too: it should be easier for baseline compilers to optimize this instruction. We can also require the address calculation to be non-wrapping, which the previous instruction sequence can't guarantee (and would be difficult to provide).

Thoughts?

aardappel commented 5 years ago

I was recently looking at dumps of some array heavy code (output of LLVM -O3 and Binaryen) and it was dominated by the above code pattern, doing the same shift/add again, and again, and again..

Besides code size benefits, there are benefits for base-line compiling engines (detecting the above pattern and then using the shift built-in to an x86 or arm addressing mode is not trivial, whereas doing that for the proposed load is). Then there are minor benefits to producers, and people reading wasm :)

Having the stack signature change based on an immediate is not ideal, but then again I think adding a parallel set of load instructions is probably worse.

Another common pattern I see in code a lot that is worth thinking about: local.get -> i32.const -> i32.add -> local.set. This is again super common in code I have seen, and could probably be made into a single instruction also.

binji commented 5 years ago

After some more discussion:

rather than encoding the scale as a shift (which can only represent powers of 2), you can encode a value using a format similar to IEEE 754 floats:

...000eeessaa

Where ss is the significand and eee is the exponent. ss=0, eee=0 cannot be used, as it is required to have the same behavior as MVP.

We can borrow from the IEEE 754 spec and treat eee=0 as the raw significand ss. All other exponents have an implicit leading 1 on the significand, and bias the exponent by -1.

You can calculate these values using the C expression:

(eee == 0 ? ss : ((4 | ss) << (eee - 1))

This creates the following scale values:

1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16, 20, 24, 28, 32, 40, 48, 56, 64, 80, 96, 112, 128, 160, 192, 224, 256, 320, 384, 448

lukewagner commented 5 years ago

Glad to have someone look into this! One important thing to consider is the performance on a 64-bit implementation using signal-handling to perform bounds checking:

Currently, one can reserve (4gb + K + ε) guard pages so that a wasm i32.load offset=K can be compiled to a single load instruction. With scale, a VM would have to reserve much more vmem to achieve the same and practical OS vmem quota limitations will probably limit many engines from even reserving enough for scale=2.

So this means we need to consider the performance on trapping systems that don't have enough vmem to handle scale=2 and scale=4. While the current wasm that gets emitted isn't pretty, the wrapping semantics are, surprisingly, rather useful for perf in that:

non-trapping shift/add are very GVN-able
there are no branches b/c wraparound is what the CPU does by default

I haven't studied the problem enough to know what the net effect of non-wrapping semantics would be (well-predicted branches are cheap; hoisting is still somewhat possible; etc), but it's at least conceivable that scale could actually be a slowdown, thus I'd request that someone do a careful measurement on a benchmark suite (comparing status quo to a compiled version that aggressively uses this extension running on a prototype implementation).

binji commented 5 years ago

Yeah, that's a good point. I thought there was a reason we didn't do this way back when :-)

We could consider specifying wrapping behavior for this -- though it would be pretty weird to have different behavior from regular memory acceses.

OTOH, since OOB access is defined to trap, we could decide to loosen behavior to say that all memory accesses have wrapping behavior. That's a much more invasive change, of course.

lukewagner commented 5 years ago

Wrapping or non-determinism definitely sound worse and also may be something we'd regret in the future: my hope is that one day in the far future we'll get to use something more like x86 segmentation (where there are [base,limit] registers that are checked after effective address computation), which would avoid the need for vmem reservations altogether (and also give wasm full use of the (base+index*scale+disp)).

binji commented 5 years ago

Wrapping or non-determinism definitely sound worse

Agreed, though it's also a bit unsatisfying to limit functionality of wasm based on a specific optimization that some implementations might never use.

One alternative that came up w.r.t. wasm64 is using a separate memprotected region to perform the "bounds-check". Assuming 4k page size, you could reduce virtual memory requirements by a factor of 16. We could also add large wasm pages, and require their use for this feature to potentially save more here. Not sure if that's worth the extra load for every memory access (and additional register pressure), though.

PaulBone commented 5 years ago

Something some Forth implementations do is strength reduction in peephole optimisation. They slide a window a few instructions wide (a peephole) over the incoming instruction stream and replace sequences like "LIT 1 ADD" with "INC" (not a normal Forth word). Generally it's good not to discard information during compilation, but for these kinds of patterns there are some advantages in leaving it to a peephole optimisation during load time, some of these are:

Gives flexibility of what to optimise to the implementations, eg it's good to optimise loads with scale and index on x86.
Allows analysis of typical programs and their frequently occurring n-grams to guide optimisation efforts. Both by guiding programmer effort for the next version of their WASM interpreter/loader but also allows loaders to make runtime decisions about what to setup optimisations for.
Keeps the format simpler.

Not saying this is right. just some points for keeping it more RISC-like.

eira-fransham commented 5 years ago

Streaming compilers can generate good code for this by using thunks, which are a more general-purpose method to emit fused instructions (for example, using lea for FMA), do better DCE and improve register allocation (for example, i32.mul; return can use 3-argument mul with rax as the output reg) . Basically this means having a value representing an unmaterialised operation on the virtual stack instead of emitting the operation immediately. No streaming compiler yet does this though, it's on the roadmap for Lightbeam but it requires a rewrite of the system for tracking values to be more of a traditional SSA setup.

I support this since it means that pretty much any memory access requires fewer instructions but any baseline compiler that's serious about performance can handle this, it doesn't strictly increase the expressiveness or optimisability of Wasm, it just makes it easier to emit better code for this specific case.

lars-t-hansen commented 5 years ago

Instead of overloading the load instruction with ad-hoc semantics we could go half-way and introduce an address calculation instruction that performs the necessary operation:

local.get $array
local.get $x
i32.addr scale=2  // $array + $x * 2, wrapped

Furthermore it seems that the i32.addr could optionally take an offset parameter in the manner of i32.load, and could also optionally take a mask to apply to the result so as to mask off lower bits:

local.get $array
local.get $x
i32.addr scale=2 offset=48 mask=3  // ($array + $x * 2 + 48) & ~3

eira-fransham commented 5 years ago

If you're doing that why not just have a FMA instruction that takes 3 params from stack, it doesn't help performance on even the simplest of compilers to enforce that the scale is static.

lars-t-hansen commented 5 years ago

If you're doing that why not just have a FMA instruction that takes 3 params from stack, it doesn't help performance on even the simplest of compilers to enforce that the scale is static.

"The simplest of compilers" will not even try to track operand values and may be helped by having the scale in the instruction instead of having to discover that the value is constant. And then there's the option of folding more operations that are typical to address computations into the instruction, if we think compilers can make use of them (open question), so FMA may not be what you want in that case.

Anyway the original motivation seems to be compactness of representation more than anything. It would be useful to see some plausible data on how much we'd actually save on the various options for encoding.

binji commented 5 years ago

Instead of overloading the load instruction with ad-hoc semantics we could go half-way and introduce an address calculation instruction that performs the necessary operation:

I like this idea; the wrapping behavior is natural and expected in this case.

Anyway the original motivation seems to be compactness of representation more than anything. It would be useful to see some plausible data on how much we'd actually save on the various options for encoding.

Yes, Wouter and I originally discussed it because he was seeing this operation a lot in his generated code. But it would be interesting to see how often it occurs in larger modules too. I starting looking at few examples the slow way (grepping and eyeballing the surrounding instructions), and found that it's common, but not overly so. Often it seems the scale is lowered to an add for each iteration of a loop, as you'd expect.

binji commented 5 years ago

I wrote a tool to read through all instructions in a wasm binary and look for common instruction sequences. It also changes all indexes to 0, and all memory offsets to 0 too. I took some wasm files that I had locally, and looked for an instruction sequence like: i32.const x i32.shl i32.add or i32.const x i32.mul i32.add. The percentage is calculated as 100 * sequence length * count / total instructions.

Even so, you can see that the equivalent of the i32.addr instruction is fairly common. If we assume a 2-byte encoding, we could save ~100K on tanks.wasm (a 12M file).

tanks (5,119,121 total instructions):

count	seq. len.	sequence	percent
19532	3	[i32.const 2 i32.shl i32.add]	1.14%
8943	4	[i32.const 2 i32.shl i32.add i32.load {align 2, offset 0}]	0.70%
5837	3	[i32.const 3 i32.shl i32.add]	0.34%
5384	3	[i32.const 12 i32.mul i32.add]	0.32%
3489	3	[i32.const 4 i32.shl i32.add]	0.20%
3129	3	[i32.const 24 i32.mul i32.add]	0.18%
2878	3	[i32.const 6 i32.shl i32.add]	0.17%
2398	3	[i32.const 20 i32.mul i32.add]	0.14%
2044	3	[i32.const 5 i32.shl i32.add]	0.12%
1900	3	[i32.const 40 i32.mul i32.add]	0.11%
1344	3	[i32.const 48 i32.mul i32.add]	0.08%

banana bread (924,654 total instructions)

count	seq. len.	sequence	percent
3312	3	[i32.const 2 i32.shl i32.add]	1.07%
1804	4	[i32.const 2 i32.shl i32.add i32.load {align 2, offset 0}]	0.78%
1233	3	[i32.const 12 i32.mul i32.add]	0.40%
738	3	[i32.const 5 i32.shl i32.add]	0.24%
723	3	[i32.const 3 i32.shl i32.add]	0.23%
595	3	[i32.const 40 i32.mul i32.add]	0.19%
512	3	[i32.const 1 i32.shl i32.add]	0.17%
478	3	[i32.const 4 i32.shl i32.add]	0.16%
441	3	[i32.const 24 i32.mul i32.add]	0.14%
417	3	[i32.const 36 i32.mul i32.add]	0.14%
351	3	[i32.const 48 i32.mul i32.add]	0.11%

ammo (bullet physics) (312,299 total instructions)

count	seq. len.	sequence	percent
633	3	[i32.const 2 i32.shl i32.add]	0.61%
248	4	[i32.const 2 i32.shl i32.add i32.load {align 2, offset 0}]	0.32%
190	3	[i32.const 4 i32.shl i32.add]	0.18%

doom3 (1,953,580 total instructions)

count	seq. len.	sequence	percent
4477	4	[i32.const 2 i32.shl local.get 0 i32.add]	0.92%
3732	3	[i32.const 2 i32.shl i32.add]	0.57%
1782	4	[i32.const 2 i32.shl i32.add i32.load {align 2, offset 0}]	0.36%
1533	5	[i32.const 2 i32.shl local.get 0 i32.add i32.load {align 2, offset 0}]	0.39%
1466	5	[i32.const 2 i32.shl i32.const 89312 i32.add i32.load {align 2, offset 0}]	0.38%
953	5	[i32.const 2 i32.shl local.get 0 i32.add f32.load {align 2, offset 0}]	0.24%

pspdfkit (3,417,737 total instructions)

count	seq. len.	sequence	percent
8116	3	[i32.const 2 i32.shl i32.add]	0.71%
3108	4	[i32.const 2 i32.shl i32.add i32.load {align 2, offset 0}]	0.36%
1559	3	[i32.const 3 i32.shl i32.add]	0.14%
1504	4	[i32.const 2 i32.shl i32.add i32.store {align 2, offset 0}]	0.18%
1200	3	[i32.const 1 i32.shl i32.add]	0.11%

webp encoder (73,089 total instructions)

count	seq. len.	sequence	percent
480	4	[i32.const 2 i32.shl local.get 0 i32.add]	2.63%
228	5	[i32.const 2 i32.shl local.get 0 i32.add i32.load {align 2, offset 0}]	1.56%
153	4	[i32.const 1 i32.shl local.get 0 i32.add]	0.84%
128	3	[i32.const 2 i32.shl i32.add]	0.53%
123	3	[i32.const 1 i32.shl i32.add]	0.50%
58	3	[i32.const 744 i32.mul i32.add]	0.24%
41	5	[i32.const 1 i32.shl local.get 0 i32.add i32.load16_u {align 1, offset 0}]	0.28%
36	3	[i32.const 5 i32.shl i32.add]	0.15%
34	3	[i32.const 3 i32.mul i32.add]	0.14%
31	4	[i32.const 2 i32.shl i32.add i32.load {align 2, offset 0}]	0.17%

mozjpeg encoder (77,684 total instructions)

count	seq. len.	sequence	percent
718	3	[i32.const 2 i32.shl i32.add]	2.77%
442	4	[i32.const 2 i32.shl i32.add i32.load {align 2, offset 0}]	2.28%
71	3	[i32.const 1 i32.shl i32.add]	0.27%
53	5	[i32.const 2 i32.shl i32.add local.tee 0 i32.load {align 2, offset 0}]	0.34%
50	3	[i32.const 7 i32.shl i32.add]	0.19%

optipng encoder (119,159 total instructions)

count	seq. len.	sequence	percent
236	3	[i32.const 2 i32.shl i32.add]	0.59%
124	3	[i32.const 1 i32.shl i32.add]	0.31%
92	4	[i32.const 2 i32.shl i32.add i32.load {align 2, offset 0}]	0.31%
65	3	[i32.const 3 i32.mul i32.add]	0.16%
43	3	[i32.const 28 i32.mul i32.add]	0.11%
37	4	[i32.const 1 i32.shl i32.add i32.load16_u {align 1, offset 0}]	0.12%

UE4 demo (15,994,748 total instructions)

count	seq. len.	sequence	percent
51468	3	[i32.const 2 i32.shl i32.add]	0.97%
25309	4	[i32.const 2 i32.shl i32.add i32.load {align 2, offset 0}]	0.63%
13386	3	[i32.const 4 i32.shl i32.add]	0.25%
13141	3	[i32.const 12 i32.mul i32.add]	0.25%
12941	3	[i32.const 3 i32.shl i32.add]	0.24%
7735	3	[i32.const 6 i32.shl i32.add]	0.15%
7491	3	[i32.const 24 i32.mul i32.add]	0.14%
7452	3	[i32.const 36 i32.mul i32.add]	0.14%
6836	3	[i32.const 5 i32.shl i32.add]	0.13%
6258	3	[i32.const 20 i32.mul i32.add]	0.12%
5945	3	[i32.const 40 i32.mul i32.add]	0.11%
5922	3	[i32.const 48 i32.mul i32.add]	0.11%

Pyodide (3,812,914 total instructions)

count	seq. len.	sequence	percent
4104	4	[i32.const 2 i32.shl local.get 0 i32.add]	0.43%
3715	3	[i32.const 2 i32.shl i32.add]	0.29%
2222	4	[i32.const 2 i32.shl i32.add i32.load {align 2, offset 0}]	0.23%
1719	5	[i32.const 2 i32.shl local.get 0 i32.add i32.load {align 2, offset 0}]	0.23%
964	3	[i32.const 1 i32.shl i32.add]	0.08%
940	3	[i32.const 3 i32.shl i32.add]	0.07%
759	3	[i32.const 4 i32.shl i32.add]	0.06%

aardappel commented 5 years ago

That is great data. It certainly speaks for an addr operation, and maybe a load. What is surprising is that we have so many cases where it is NOT followed by a load.. looks like it is common to compute an address first and only use it later. I also suspect that there are other common patterns between this and the load, like an additional non constant offset.

So yeah, i32.addr sounds pretty pragmatic, based on this data.

binji commented 5 years ago

Chatting w/ @aardappel offline, we realized that a lot of these modules are older, and are optimized differently. Here's some data from newer modules. It seems to have similar distributions, though perhaps the sequences are less common (this is likely because newer modules seem to be moving the constant base closer to the add; see x86 emulator for example).

edit: actually, it seems that Godot is pretty old too (2017).

Godot engine (6,641,590 total instructions)

count	seq. len.	sequence	percent
11074	3	[i32.const 2 i32.shl i32.add]	0.50%
4422	4	[i32.const 2 i32.shl i32.add i32.load {align 2, offset 0}]	0.27%
4623	3	[i32.const 24 i32.mul i32.add]	0.21%
4508	3	[i32.const 3 i32.shl i32.add]	0.20%
2649	3	[i32.const 4 i32.shl i32.add]	0.12%

x86 emulator (163,474 total instructions)

count	seq. len.	sequence	percent
285	5	[i32.const 2 i32.shl i32.const 23628 i32.add i32.load {align 2, offset 0}]	0.87%
253	5	[i32.const 2 i32.shl i32.const 8412236 i32.add i32.load {align 2, offset 0}]	0.77%
306	4	[i32.const 2 i32.shl i32.const 23628 i32.add]	0.75%
301	4	[i32.const 2 i32.shl i32.const 8412236 i32.add]	0.74%
214	5	[i32.add i32.const 2 i32.shl i32.const 8412236 i32.add]	0.65%
194	5	[i32.add i32.const 2 i32.shl i32.const 23628 i32.add]	0.59%
239	4	[i32.const 2 i32.shl i32.const 23072 i32.add]	0.58%
177	5	[i32.and i32.const 2 i32.shl i32.const 23072 i32.add]	0.54%
98	5	[i32.const 2 i32.shl i32.const 23072 i32.add i32.load {align 2, offset 0}]	0.30%
88	4	[i32.const 4 i32.shl i32.const 23152 i32.add]	0.22%
95	3	[i32.const 2 i32.shl i32.add]	0.17%

makepad (345,687 total instructions)

count	seq. len.	sequence	percent
206	3	[i32.const 2 i32.shl i32.add]	0.18%
85	3	[i32.const 12 i32.mul i32.add]	0.07%

sandspiel (33,841 total instructions)

count	seq. len.	sequence	percent
151	4	[i32.const 2 i32.shl local.tee 0 i32.add]	1.78%
103	3	[i32.const 2 i32.shl i32.add]	0.91%
75	4	[i32.const 2 i32.shl i32.add i32.load8_u {align 0, offset 0}]	0.89%
41	5	[i32.const 2 i32.shl local.tee 0 i32.add i32.const 0]	0.61%
26	3	[i32.const 12 i32.mul i32.add]	0.23%
12	5	[i32.const 2 i32.shl local.tee 0 i32.const 1048788 i32.add]	0.18%
14	4	[i32.const 2 i32.shl i32.add i32.load {align 0, offset 0}]	0.17%

PaulBone commented 5 years ago

Are these all generated from the same original language & toolchain. We may find that this changes when more languages and toolchains generate wasm.

binji commented 5 years ago

Yes, primarily fastcomp (asm.js) + asm2wasm, or LLVM + binaryen. Agreed it would be good to look at other toolchains, do you have some good examples I should try?

binji commented 5 years ago

@kronicdeth here's the tool I was using to generate this data: https://github.com/binji/wasp#wasp-pattern-examples

PaulBone commented 5 years ago

@binji I don't have any myself. Sorry. Just a general wish that there should be more. mumbles something about FP and other paradigms using WASM sometime.

WebAssembly / design