WebAssembly / spec

WebAssembly specification, reference interpreter, and test suite.
https://webassembly.github.io/spec/
Other
3.15k stars 449 forks source link

Maximum alignment? #217

Closed jfbastien closed 8 years ago

jfbastien commented 8 years ago

What's the maximum alignment for load/store?

We should probably specify something, e.g. log2(address_space_bytes), and test for it.

mbodart commented 8 years ago

By "test for it", I assume you mean just a range check on the alignment value.

If I'm interpreting AstSemantics.md correctly, the alignment is just a performance hint, not a "trappable" guarantee. Is this a correct interpretation?

jfbastien commented 8 years ago

Correct, check it in the implementation, and have corresponding tests in the spec test suite.

Your interpretation is correct.

ghost commented 8 years ago

This came up before somewhere, and I recall some support for the page size being the maximum. If the minimum is one then this could be encoded in 4 bits. But perhaps 256 is also a reasonable limit, encoded in 3 bits, or even 16 encoded in 2 bits. The v8 encoding only has one bit for the alignment - the natural alignment flag.

kg commented 8 years ago

Don't bother worrying about bitpacking. Every stream compressor will do it for you. :-)

ghost commented 8 years ago

@kg That's not my experience. It helps the compressor to pack and avoid unnecessary redundancy. Even brotli only looks back one or two bytes. We should decide on the limits, the lower limit seems to be one and the upper limit at most the page size. We are only interested in powers of 2, so only encode the power of two. The gives fours bits at most. We might be able to do even better given knowledge of the operations natural alignment - brotli only looks at 6 prior bits and might not even notice the difference between the opcodes to focus the codes.

ghost commented 8 years ago

Sorry, if the lower limit is one, that would be 2^0 so four bits only gets to 2^15 which is half of the wasm page size. Similar error in the other suggestions.

kg commented 8 years ago

@kg That's not my experience. It helps the compressor to pack and avoid unnecessary redundancy. Even brotli only looks back one or two bytes. We should decide on the limits, the lower limit seems to be one and the upper limit at most the page size. We are only interested in powers of 2, so only encode the power of two. The gives fours bits at most. We might be able to do even better given knowledge of the operations natural alignment - brotli only looks at 6 prior bits and might not even notice the difference between the opcodes to focus the codes.

I tested this specific question and talked to one of the designers of Brotli. Once you split streams, the alignment values will be a stream of 1 (or 4) byte values which can be arithmetic coded or otherwise represented very efficiently. If we bitpack the alignment into 3-5 bits with other unrelated data in the other bits, what we're actually doing is destroying some of the potential for the compressor to efficiently encode either chunk of data.

Encoding the alignment as a power of two is a perfectly reasonable idea, but if you cap the alignment at some arbitrary threshold so you can pack bits, you may regret it. Vectors >256bytes in size are an inevitability, as are user-defined structures of size >256bytes.

ghost commented 8 years ago

@kg I really doubt that is correct, but we can settle it in a test. Here's why: most of the alignments will be the natural alignment. In a separate stream they will jump around and there is no context. Whereas in inline there is the context of the operator which could predict the natural alignment most of the time, most of the time it would be just one bit. What we might be able to do is optimize the encoding to be 0 for the natural alignment, then when packed in a separate stream it would compress very well.

kg commented 8 years ago

@kg I really doubt that is correct, but we can settle it in a test.

I tested it. It's in js-astcompressor, and you can also see some general things to this effect in the JSZap paper. If you are confused I would be happy to explain it to you in depth offline.

In a separate stream they will jump around and there is no context. Whereas in inline there is the context of the operator which could predict the natural alignment most of the time, most of the time it would be just one bit.

This does bring up an important detail. If your stream splitting is coarse-grained, like 'here is all the uint8s', you're correct that the values will be harder to predict. That's not the right way to do stream splitting (though it is, in fact, better than nothing). You want to split streams semantically, so your individual streams are things like 'opcodes', 'int8 constant immediates' and 'memory load alignments'.

ghost commented 8 years ago

@kg I think I understand your point, but have not explore it. Here's a test.

The context helps. Another option would be to keep the one bit flag, in the v8 encoding, and have a separate stream for the non-natural cases, and in that case since most of the alignments are natural this side stream would be very small.

mbodart commented 8 years ago

Is there any data to show that natural alignment is common?

Yes, data is typically naturally aligned when allocated. But given a pointer parameter, say "int *p", a C/C++ compiler often cannot guarantee that p is well aligned as it most likely hasn't seen its allocation. Are we expecting C/C++ compilers to optimistically mark references with natural alignment, and only specify lesser alignment when that is clear to them?

Given that alignment is only a hint, it's utility is not clear to me.

sunfishcode commented 8 years ago

We do expect C/C++ compilers to optimistically mark references with natural alignment, even when they haven't seen the allocation, and only specify lesser alignments with that is clear to them. Natural alignment is the overwhelming majority.

kg commented 8 years ago

@mbodart

Given that alignment is only a hint, it's utility is not clear to me.

The importance of the alignment hint is largely for the polyfill, but it matters on some native architectures too: If we want to do unaligned reads/writes in asm.js we have to jump through some obscene hoops to make them work. So if we don't know the alignment status of any memory operations, we eat a performance hit on all memory traffic.

The alignment hint allows a compiler to say 'it's okay to break me if I lie about my alignment' in the common case so that you only get bad performance in the spots where it's needed.

I agree with @sunfishcode that natural alignment is probably the overwhelming majority in practice. I do think it's the case that some compilers will not trivially be able to flag loads/stores as aligned without doing a full global analysis, though...

mbodart commented 8 years ago

I think we're in agreement, as long as when you say "it's okay to break me" you really mean "it's okay to lessen my performance, but not crash".

I'm not so concerned about MVP. But when simd comes along, the current alignment definition does not allow an engine to choose movaps instead of movups (on X86), as movaps would fault if unaligned. This may be moot as movups performs well on more recent cpus when the data is aligned.

sunfishcode commented 8 years ago

I believe wasm's current alignment hints do permit implementations to use movaps for SIMD, provided they're capable of catching the hardware trap and fixing things up.

ghost commented 8 years ago

This seems to have been resolved in the binary format which has a maximum alignment of the access natural alignment, or is it still open to discussion to increased this maximum?

lukewagner commented 8 years ago

Agreed.

sunfishcode commented 8 years ago

This is not yet implemented in ml-proto. For example, memory.wast tests that (module (memory 0) (func (i32.load align=8 (i32.const 0)))) validates.

dschuff commented 8 years ago

Actually, would it make sense to at least allow say 4 or 8, even for smaller-sized memory references? I could totally imagine there being plenty of e.g. byte-sized loads with alignments of 4, and I don't see any reason to pessimize those. Is there some particular reason the binary format doesn't allow that?

dschuff commented 8 years ago

(By the way I noticed that @sunfishcode committed an LLVM patch limiting the alignment on loads and stores to natural alignment. I guess that was just so that its output could be properly encoded with the 0xb format today?)

flagxor commented 8 years ago

Was size savings the only argument for capping at natural alignment? Seems a shame to throw away information we have in LLVM. Though I admit implementations probably won't get around to using it for a bit. Dan, Ben, others, is there any particular reason not to allow up to page size?

titzer commented 8 years ago

As a hint, it can't do any damage.

jfbastien commented 8 years ago

Could we allow alignment hints on load / store up to page size?

kripken commented 8 years ago

Could that be encoded in a way that doesn't increase the binary size noticeably?

jfbastien commented 8 years ago

If it's frequent enough to matter then macro-compression will pick it up. If a user is OK dropping hints to lose potential perf then it can be stripped away.

The hint is useful in guiding speculative optimizations by the JIT (test and recompile, otherwise main code assumes alignment is correct). You can still do that speculation without the hint, but why throw away information LLVM has?

titzer commented 8 years ago

On Mon, Apr 25, 2016 at 6:37 PM, JF Bastien notifications@github.com wrote:

If it's frequent enough to matter then macro-compression will pick it up. If a user is OK dropping hints to lose potential perf then it can be stripped away.

The hint is useful in guiding speculative optimizations by the JIT (test and recompile, otherwise main code assumes alignment is correct). You can still do that speculation without the hint, but why throw away information LLVM has?

The question is whether encoding page size alignment (a 3-byte LEB) is worth it, versus just specifying natural alignment.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/WebAssembly/spec/issues/217#issuecomment-214432586

kripken commented 8 years ago

If it's frequent enough to matter then macro-compression will pick it up.

Sure, but perhaps to the detriment of other things the compression could have reduced. I think this is worth measuring.

sunfishcode commented 8 years ago

What kinds of optimizations would supernatural alignment hints enable? LLVM already does the main alignment-related optimizations, such as merging adjacent memory accesses into larger accesses.

jfbastien commented 8 years ago

What kinds of optimizations would supernatural alignment hints enable? LLVM already does the main alignment-related optimizations, such as merging adjacent memory accesses into larger accesses.

Depends on the architecture. The basic one is to speculate is-aligned, vectorize or merge some operations but avoid scalar entry+exit loops as well as isn't-aligned loop. Test-and-interp on failure, recompile if too often. Less code is executed dynamically and less is generated statically. That's paid off in my experience.

sunfishcode commented 8 years ago

@jfbastien LLVM already does that optimization. If it has a bunch of sequential memory accesses and it knows the alignment of the whole group, and it's not too much work to marshal the data around for a merged access, it'll merge them itself.

jfbastien commented 8 years ago

@jfbastien LLVM already does that optimization. If it has a bunch of sequential memory accesses and it knows the alignment of the whole group, and it's not too much work to marshal the data around for a merged access, it'll merge them itself.

I don't think we're talking of the same thing: I'm talking about speculative optimizations which lead to recompilation on speculation failure. One of the big costs of speculation is when the optimizer is over-eager and assumes things that aren't true, that's where the extra info would pay off.

sunfishcode commented 8 years ago

LLVM's own alignment information is never speculative in nature. At the LLVM level, it's full-octopus UB if the alignment isn't correct.

In general, wasm's alignment hint works well for these kinds of "highly confident" cases. If we change wasm's alignment hint to be used for more speculative use cases, such as humans decorating their own code, then VM's will have to be more conservative with it, which would weaken it for the high-confidence use cases.

jfbastien commented 8 years ago

LLVM's own alignment information is never speculative in nature. At the LLVM level, it's full-octopus UB if the alignment isn't correct.

Indeed, but wasm can't take that for granted. Using that information for speculation will therefore always be correct, unless the developer gives incorrect information.

In general, wasm's alignment hint works well for these kinds of "highly confident" cases. If we change wasm's alignment hint to be used for more speculative use cases, such as humans decorating their own code, then VM's will have to be more conservative with it, which would weaken it for the high-confidence use cases.

That's not what I'm suggesting: I'm saying that the VM can speculate because the developer hinted that the alignment value was correct. These types of high-confidence cases are exactly where we want to avoid throwing away information.

sunfishcode commented 8 years ago

If the producer has high confidence, it can just do the optimization itself. LLVM already does this. Other optimizing compilers commonly do such things. It wouldn't be too difficult to implement this in a relatively lightweight JIT library. Doing this on the producer side reduces time, complexity, and unpredictable performance on the consumer side, so it's worth encouraging.

While speculative optimizations with bailouts and recompilations may be easy to support in some of today's JS-engine-based wasm implementations, there are advantages to building wasm in a way that minimizes its dependence on them to achieve good performance. Greater alignment hints to support greater speculative optimization is one path we might take here, but there are other paths we can take here as well, from a broader perspective.

lukewagner commented 8 years ago

The question is whether encoding page size alignment (a 3-byte LEB) is worth it, versus just specifying natural alignment.

@titzer Since alignment in flags is currently encoded as log2(alignment) 64k-alignment would still fit in 1 byte.

One thing that bugs me about the current design is that the number of reserved bits depends on the load/store width. That seems bug-prone for everyone involved; it means that if we, e.g., use the bits after for ordering annotations that they start at a different bit for i32.load than i64.load. Simply reserving a fixed number of bits for the max alignment we think we might ever need seems simpler. 2 bits only buys 8-byte alignment, which is already too small for i32x4; 3 bits bumps that way up to 128-byte alignment but that is "only" 2x bigger than AVX-512 so it still seems shortsighted. 4 bits provides 32k-alignment which seems like truly enough alignment for anyone. So how about just fixing alignment to 4 bits in flags?

(I'm pretty dubious about autovectorization opportunities in wasm, though. For one thing, I think it benefits the whole platform if wasm engines aren't trying to do fancy optimizations since these add cost (time and predictability) and, once the race begins, engines can be forced into a situation where they end up hurting the common case for the benefit of benchmarketting; we've seen this with JS. Instead, we should strive to expose the underlying mechanism (viz., SIMD ops) so that wasm generators can be explicit about what they want and not rely on magic. So I'd wish we could strike autovectorization as an argument for super-natural alignment (or anything else, for that matter).)

ghost commented 8 years ago

@lukewagner Four bits was my suggestion way above too, and could you please also consider assigning the natural alignment the value zero to help compression too.

lukewagner commented 8 years ago

If we can fix immediates in the opcode table (we're working on a prototype impl so we can make a real proposal soon), there won't be any immediate in the common case, so it seems like we won't have to worry about compression effects of the immediate.

rossberg commented 8 years ago

Check now implemented.