Alignment will probably require implementation-defined behavior

titzer commented 9 years ago

It seems that some ARM implementations may ignore the low order bits of unaligned memory accesses and thus round down to the next aligned address. That would mean that every access that the engine cannot prove is properly aligned would need a dynamic check (since these processors won't cause a hardware fault). That may be too slow or too much code.

Would it be reasonable to spec aligned/unaligned accesses thusly?

All accesses require alignment to be specified.
Load/Store[aligned=true] have implementation-defined behavior when the offset is not actually aligned.
Load/Store[aligned=unknown] never have implementation-defined behavior, but may be slow on some architectures when the offset is not actually aligned.

For both kinds of accesses we could specify a sanitizer mode that will trap on Load/Store[aligned=true](actually not aligned) and profile or warn on Load/Store[aligned=unknown](actually not aligned).

The above would allow the engine to omit checks for the [aligned=true] case, accepting whatever the hardware does, but still require it to emit checks for [aligned=unknown] on these processors.

sunfishcode commented 9 years ago

Is there any documentation available on these ARM architectures? I'm interested in learning more.

kripken commented 9 years ago

Me too. Specifically, I wonder if those ARM implementations just silently do the rounding (that would be exactly what JS typed arrays do, ironically :) ? Or do they trap?

kg commented 9 years ago

I seem to remember your proposal roughly being the consensus from prior discussions. Obligatory aligned/unaligned distinction, with unaligned operatoins Always Working but possibly being slow, and aligned-with-unaligned-address being potentially undefined seems good to me, albeit a little gross.

That distinction is already really important for the polyfill to be remotely usable without breaking applications that do unaligned loads/stores.

The last time I shipped ARM code (on a particular handheld console), it trapped on unaligned accesses in some scenarios (non-32-bit load/store) and was Just Slow in other cases. I think in some cases you can configure the behavior, so it might depend on the OS/host application and not just the hardware.

jfbastien commented 9 years ago

I thought we had agreed to have explicit alignment to a specific byte number (not just true/unknown). The rest is what I recall: if the program lied then implementation-defined behavior occurs.

I wouldn't spec the sanitizers: they can either be done by the developer-side compiler, or by the implementation (maybe behind a flag). I see sanitizers as tools that should "just work", so there's no need to spec them.

The ARM specs aren't accessible publicly, but you can get the PDF for free by registering. This behavior, IIRC, is pre-ARMv7 and in some R and M profile CPUs. Most ARM CPUs sold in consumer devices recently are ARMv7 A profile, or ARMv8, but it would be nice for Web Assembly to work on these other CPUs which are often used in smaller IoT devices (you know we want Web Assembly to be IoT compliant!!!).

titzer commented 9 years ago

Here's a link to a section in the ARM architecture reference manual:

https://books.google.de/books?id=O5G-6WX1xWsC&pg=PT57&lpg=PT57&dq=unaligned+access+on+arm+ignore+lower+bits&source=bl&ots=_d6f1Osah6&sig=RO95auOcu78sxqzgsHY4KpmEwxE&hl=en&sa=X&ei=tOttVYHFC-XuyQOozICgCQ&ved=0CCEQ6AEwAA#v=onepage&q=unaligned%20access%20on%20arm%20ignore%20lower%20bits&f=false

On Tue, Jun 2, 2015 at 7:02 PM, Dan Gohman notifications@github.com wrote:

Is there any documentation available on these ARM architectures? I'm interested in learning more.

— Reply to this email directly or view it on GitHub https://github.com/WebAssembly/spec/issues/105#issuecomment-108016840.

sunfishcode commented 9 years ago

By my reading of the documentation:

ARMv5 and earlier have the alignment-rounding problem.

ARMv6 has multiple configuration modes. The "Legacy" mode behaves like ARMv5. However, many popular ARMv6 implementations, such as Linux on Raspberry Pi, seem to use one the newer modes that don't have the problem.

In ARMv7 and ARMv8, documentation I have says that the "Legacy" configuration mode is no longer present, and they don't have the problem.

Assuming I didn't miss anything, this appears to come down to a question of the limits of portability (#38). Is ARMv5 or ARMv6-in-legacy-mode worth supporting, at the cost of weakening the spec wrt alignment?

pizlonator commented 9 years ago

Thanks for summarizing this!

ARMv5 is pretty old. I think we'd have to have a super good argument in its favor if we wanted to complicate the spec with it.

-Fil

On Jun 2, 2015, at 12:52 PM, Dan Gohman notifications@github.com wrote:

By my reading of the documentation:

ARMv5 and earlier have the alignment-rounding problem.

ARMv6 has multiple configuration modes. The "Legacy" mode behaves like ARMv5. However, many popular ARMv6 implementations, such as Linux on Raspberry Pi, seem to use one the newer modes that don't have the problem.

In ARMv7 and ARMv8, documentation I have says that the "Legacy" configuration mode is no longer present, and they don't have the problem.

Assuming I didn't miss anything, this appears to come down to a question of the limits of portability (#38). Is ARMv5 or ARMv6-in-legacy-mode worth supporting, at the cost of weakening the spec wrt alignment?

― Reply to this email directly or view it on GitHub.

MikeHolman commented 9 years ago

For us, only ARMv7 THUMB/THUMB2 matter. Of course we aren't in a vacuum so I'm fine making concessions where necessary, but it doesn't sound like ARMv5/legacy mode is important enough to weaken the spec.

titzer commented 9 years ago

Good catch, Dan.

I also just verified that the arm64 specification only requires alignment for ordered and exclusive loads and stores; others are fine to be unaligned. The processor does have a strict alignment checking mode that will trap on unaligned accesses, so it's got that going for it, which is nice.

V8 cares about architectures in roughly this order: X64, ia32, arm, arm64, mips, mips64, ppc.

I'll do some digging into those few at the end and see if there are any issues with alignment that impact this.

On Tue, Jun 2, 2015 at 9:52 PM, Dan Gohman notifications@github.com wrote:

By my reading of the documentation:

ARMv5 and earlier have the alignment-rounding problem.

ARMv6 has multiple configuration modes. The "Legacy" mode behaves like ARMv5. However, many popular ARMv6 implementations, such as Linux on Raspberry Pi, seem to use one the newer modes that don't have the problem.

In ARMv7 and ARMv8, documentation I have says that the "Legacy" configuration mode is no longer present, and they don't have the problem.

Assuming I didn't miss anything, this appears to come down to a question of the limits of portability (#38 https://github.com/WebAssembly/spec/issues/38). Is ARMv5 or ARMv6-in-legacy-mode worth supporting, at the cost of weakening the spec wrt alignment?

— Reply to this email directly or view it on GitHub https://github.com/WebAssembly/spec/issues/105#issuecomment-108077007.

titzer commented 9 years ago

I've checked with some MIPS and PPC experts and the result is this: no problem on PPC (should be Intel-fast), and MIPS cores trap to kernel for emulation, but chips are coming that just do it in hardware. So it looks like we're all good if we make the reasonable decision to ignore 10 year old arm cores. I'll double check with the folks at ARM, though.

jfbastien commented 9 years ago

@titzer it's not just older ARM core: it's low-power / embedded ones too. I've talked to folks running node.js on tiny chips inside lightbulbs, do we care about this type of user? To which degree?

I'm probably OK saying: we expect fully compliant Web Assembly implementations to have behavior X, but some not-too-compliant implementations could do Y.

I'd rather not ban this behavior outright because I think the usecase matters. It would be nice to have a compliance suite, and implementations can list how they diverge from the spec. When it's "benign" divergences like this I think it's fine.

kripken commented 9 years ago

Would those older ARM cores and tiny low-power embedded chips have larger divergences from "normal" behavior than the polyfill will? Given wasm code that properly annotates the alignment of loads and stores (never says they are aligned when they aren't), both those chips and the polyfill will perform properly, is my understanding correct?

titzer commented 9 years ago

On Thu, Jun 4, 2015 at 8:10 PM, Alon Zakai notifications@github.com wrote:

Would those older ARM cores and tiny low-power embedded chips have larger divergences from "normal" behavior than the polyfill will? Given wasm code that properly annotates the alignment of loads and stores (never says they are aligned when they aren't), both those chips and the polyfill will perform properly, is my understanding correct?

Cores that drop the lower bits from unaligned accesses will require checks inserted by the wasm engine, with emulation code done in user land. All code on those cores pays, even if they always stay aligned.

Cores that trap will go to the kernel and the user program only pays when they actually go unaligned.

— Reply to this email directly or view it on GitHub https://github.com/WebAssembly/spec/issues/105#issuecomment-108995665.

sunfishcode commented 9 years ago

@jfbastien Can you be more specific about which models of ARM cores these are? I've checked ARMv7-R and ARMv7-M documentation and both are ok here.

sunfishcode commented 9 years ago

Looks like ARMv6-M is good too.

kripken commented 9 years ago

@titzer: not sure I follow? If a load/store is marked as aligned, then it doesn't need to pay any cost, does it? The VM can emit an aligned access, and if the code lied and it turns out unaligned, it's ok that it drops the lower bits - just like the polyfill does.

And if the load/store is marked as unaligned, then a slow path would be taken, definitely paying a cost, but likewise, around the same as the polyfill pays. And in practice we hope little code would be marked as unaligned, so both polyfill and older/smaller CPUs would be ok.

I feel like the older/smaller CPU case is very similar to the polyfill, overall. Am I missing something?

titzer commented 9 years ago

On Thu, Jun 4, 2015 at 8:51 PM, Alon Zakai notifications@github.com wrote:

@titzer https://github.com/titzer: not sure I follow? If a load/store is marked as aligned, then it doesn't need to pay any cost, does it? The VM can emit an aligned access, and if the code lied and it turns out unaligned, it's ok that it drops the lower bits - just like the polyfill does.

And if the load/store is marked as unaligned, then a slow path would be taken, definitely paying a cost, but likewise, around the same as the polyfill pays. And in practice we hope little code would be marked as unaligned, so both polyfill and older/smaller CPUs would be ok.

That's OK; marking unaligned accesses is a kind of opt-in to may-be-slow.

I feel like the older/smaller CPU case is very similar to the polyfill, overall. Am I missing something?

You are requiring masking for aligned accesses. See first post. I was assuming that aligned accesses would not be masked.

— Reply to this email directly or view it on GitHub https://github.com/WebAssembly/spec/issues/105#issuecomment-109006360.

kripken commented 9 years ago

I still don't understand why a claimed-aligned access would require a mask. Why not just emit an access without a mask, on these old/small CPUs? (It might silently drop some bits, but that's what the mask would have done anyhow?)

titzer commented 9 years ago

On Thu, Jun 4, 2015 at 9:03 PM, Alon Zakai notifications@github.com wrote:

I still don't understand why a claimed-aligned access would require a mask. Why not just emit an access without a mask, on these old/small CPUs? (It might silently drop some bits, but that's what the mask would have done anyhow?)

Because on Intel and processors that support unaligned access properly, it will read/write unaligned memory, and you will get different results than on these older CPUs, or if you had dropped the lower bits in the engine with a mask.

— Reply to this email directly or view it on GitHub https://github.com/WebAssembly/spec/issues/105#issuecomment-109011463.

sunfishcode commented 9 years ago

We specifically don't want to be bound by present-day limitations of JS semantics in the long term, so we don't want to get too accustomed to saying "the polyfill did XYZ, so it's ok if other implementations do that too".

kripken commented 9 years ago

@titzer: Yes, but that is exactly as in the polyfill, and we allow it, don't we?

I may have a big misunderstanding here. I was under the impression that if one lied about alignment, claiming it was aligned when it wasn't, then we said that was not fully specified. And the polyfill would then be free to do the "wrong" thing by dropping the lower bits, thus letting it remain fast (otherwise, each load would need to support the case of it being unaligned). In practice, this is fine because the compiler should know what is aligned and what might not be, and we can mark the rare loads which might not be, as unaligned. But 99% of them would be aligned, and fast in the polyfill, and correct in the polyfill.

Did I get that wrong? Are we not saying that claiming alignment but lying leads to implementation-defined behavior?

kripken commented 9 years ago

@sunfishcode: I 100% agree. I wasn't saying that the polyfill does it so it's fine. I am saying that I understood what the polyfill did to be fine because of reason X, and that reason X is valid in itself, and it looks like X applies to old/weak CPUs too. Unless have I misunderstood X all this time?

titzer commented 9 years ago

Just in case it wasn't clear from start, the goal here was:

1.) If you promise an access is aligned, and it is, you pay nothing, not even a mask. 2.) If you promise an access is aligned and you lied, you get something strange (not nasal demons, but maybe slow, maybe a trap, or maybe you get forcibly aligned). 2b.) In sanitizer mode, if you promise an access is aligned and you lied, you get a trap. 3.) If you said an access is unaligned, it will work on all engines and give you the exact same results. It might be really slow, though.

On Thu, Jun 4, 2015 at 9:22 PM, Dan Gohman notifications@github.com wrote:

We specifically don't want to be bound by present-day limitations of JS semantics in the long term, so we don't want to get too accustomed to saying "the polyfill did XYZ, so it's ok if other implementations do that too".

— Reply to this email directly or view it on GitHub https://github.com/WebAssembly/spec/issues/105#issuecomment-109018618.

kripken commented 9 years ago

@titzer: Yes! :) And is not (2) covered by emitting a load without a mask on those old/small CPUs? You get a forcibly aligned result, which is one of the options you listed.

That's all I've been saying here: aligned loads/stores do not need masks in the polyfill nor on old/small CPUs, assuming those CPUs just ignore the lower bits. So both can be fast on aligned code, and also correct if actually aligned, so they are quite similar in that respect.

(edit: by "masks in the polyfill" i mean "written in the JS code". While of course the VM must emit a mask, because it is JS and has precise semantics. But if the underlying CPU were a weak/old one which itself drops the lower bits and force-aligns, then the VM could actually avoid that, as if the hardware were specialized for typed arrays being aligned ;)

titzer commented 9 years ago

On Thu, Jun 4, 2015 at 9:30 PM, Alon Zakai notifications@github.com wrote:

@titzer https://github.com/titzer: Yes! :) And is not (2) covered by emitting a load without a mask on those old/small CPUs? You get a forcibly aligned result, which is one of the options you listed.

That's all I've been saying here: aligned loads/stores do not need masks in the polyfill nor on old/small CPUs, assuming those CPUs just ignore the lower bits. So both can be fast on aligned code, and also correct if actually aligned, so they are quite similar in that respect.

Yes, I realized that on a closer reading of your comments that we're basically in agreement. That does mean that we do have implementation-defined behavior for that [aligned=true]/lied case, which actually I was kind of hoping we could find a way around.

—

Reply to this email directly or view it on GitHub https://github.com/WebAssembly/spec/issues/105#issuecomment-109020079.

sunfishcode commented 9 years ago

The other side here is that we have yet to actually name a CPU here which we really care about which actually needs implementation-defined behavior. Unless this changes, it'd be great to just stick with our current rules, which don't have the implementation-defined behavior part.

kripken commented 9 years ago

@titzer: Ok, good, now I think we are on the same page.

Given

The rarity of the problem, as supported by both theoretical arguments (unaligned is undefined behavior in C/C++) and practical experience (many codebases ported to typed array semantics, almost no issues; and sanitizer tools fix the few that do),
As @jfbastien says, tiny CPUs exist, not just old ones,
The polyfill will not just matter for a few months but for a very long time.

Then in practice, what difference does it make if [aligned=true]/lied is described as implementation-defined behavior, or not? It seems a philosophical point. Regardless of how we call it, those tiny CPUs and the polyfill will still be able to run wasm codebases just fine, and they will be used to run those codebases.

Is there a practical, concrete benefit to not calling this implementation-defined behavior?

sunfishcode commented 9 years ago

Every bit of implementation-specific behavior we add is an opportunity for applications to behave differently across different implementations. I'm not opposed to all implementation-specific behavior, but it'd be nice if someone could name something more interesting than ARMv5 before we accept it here.

titzer commented 9 years ago

Actually a second round with MIPS folks was less promising. Apparently some devices ship with a mode where unaligned accesses aren't handled by the kernel and they cause a bus error; that hurts and puts the engine back in the emulating the unaligned access in userland situation. They were also pretty uncomfortable with the performance penalty. Masking might be the best option on those processors. I asked ARM for some clarification about how prevalent ARM chips with the bit-ignoring behavior is; waiting to hear back.

I'm not clear on why we want an alignment annotation if it doesn't make any semantic difference; if [aligned=true]/lied gives exactly the same results as [aligned=false]/not_aligned, then why have it? Is it just to make the latter case fast by always emulating it in userland to avoid kernel traps on crappy hardware?

On Thu, Jun 4, 2015 at 9:57 PM, Dan Gohman notifications@github.com wrote:

Every bit of implementation-specific behavior we add is an opportunity for applications to behave differently across different implementations. I'm not opposed to all implementation-specific behavior, but it'd be nice if someone could name something more interesting than ARMv5 before we accept it here.

— Reply to this email directly or view it on GitHub https://github.com/WebAssembly/spec/issues/105#issuecomment-109031449.

lukewagner commented 9 years ago

@titzer The difference is that, on architectures with fast unaligned access, [aligned=false] will emit a plain full-size load and on architectures with slow unaligned access, [aligned=false] will emit byte loads with an OR. This is assuming we don't relax what we currently have in the spec (which is no masking, fully deterministic, Just Works). I really hope we don't have to relax, at least not without strong justification (% market saturation of devices). That the polyfill masks (by default, the polyfill can just as well make this an option to always emit byte loads and bitor) is just a willful choice for the polyfill to be incorrect w.r.t the spec for performance reasons.

titzer commented 9 years ago

On Thu, Jun 4, 2015 at 10:28 PM, Luke Wagner notifications@github.com wrote:

@titzer https://github.com/titzer The difference is that, on architectures with fast unaligned access, [aligned=false] will emit a plain full-size load and on architectures with slow unaligned access, [aligned=false] will emit byte loads with an OR. This is assuming we don't relax what we currently have in the spec (which is no masking, fully deterministic, Just Works). I really hope we don't have to relax, at least not without strong justification (% market saturation of devices). That the polyfill masks (by default, the polyfill can just as well make this an option to always emit byte loads and bitor) is just a willful choice for the polyfill to be incorrect w.r.t the spec for performance reasons.

Sure, that all makes sense, I just mean that if we spec it the way suggested in this issue, then the polyfill is technically correct, as would any masking implementation and all the old CPUs. Otherwise, I see the alignment attribute has marginal value except improving performance on trapping CPUs.

— Reply to this email directly or view it on GitHub https://github.com/WebAssembly/spec/issues/105#issuecomment-109039065.

lukewagner commented 9 years ago

@titzer Agreed, just hoping we don't have to :)

sunfishcode commented 9 years ago

I've recently been told that there are some relevant Android devices that have alignment trapping which is imprecise, meaning it may not be feasible for an implementation to fix up misaligned accesses. I don't know the specifics yet, so I don't have a recommendation for what we should do yet, but I do want to outline a possible backup plan.

A backup plan is that we say that it's nondeterministic whether a misaligned accesses succeeds or traps (and the trap wouldn't be recoverable in this case). That still rules out silent behavior changes (and ARMv5), so we could broaden the set of supportable platforms while still avoiding the worst of the portability risks.

lukewagner commented 9 years ago

Another alternative is, if this is only a small % of users (and I assume old kernel versions that time will obsolete), keep the spec deterministic and any wasm impl on those devices can choose between being incorrect wrt the spec (faulting) or branching on alignment before each load/store (as part of the bounds check) at some runtime cost. I think this is preferable since, whether or not the standard declares trapping as a valid nondeterministic execution, it will be a little-tested path since most devs won't have one of these devices.

sunfishcode commented 9 years ago

Trapping may be an untested path, but it's a very short one :-). And we already have nondeterministic trapping when a program runs out of callstack space ("at any time"), so it's not a great new imposition. I think it'd be nicer to acknowledge this in the spec than issuing exemptions that create a de-facto spec on top of the original one.

lukewagner commented 9 years ago

Well, the option I'd expect engines to instead take (e.g., that I would want FF to take) in this case would be to use dynamic branching to handle misaligned accesses; otherwise, spec-blessed or not, programs are going to fault when running on the platform (especially since we make misaligned accesses Just Work).

titzer commented 9 years ago

This issue needs some more investigation when we get closer to having a full-performance native implementation, but there is also a semantic question here.

At the risk of making the spec a little weaker, it'd be nice to quash a potential proliferations of slightly spec-noncompliant implementations. In particular, how far are we willing to let the polyfill get out of compliance in order to allow it to make use of the slightly different alignment semantics of asm.js?

On Tue, Jul 14, 2015 at 6:18 PM, Luke Wagner notifications@github.com wrote:

Well, the option I'd expect engines to instead take (e.g., that I would want FF to take) in this case would be to use dynamic branching to handle misaligned accesses; otherwise, spec-blessed or not, programs are going to fault when running on the platform (especially since we make misaligned accesses Just Work).

— Reply to this email directly or view it on GitHub https://github.com/WebAssembly/design/issues/105#issuecomment-121422827.

pizlonator commented 9 years ago

Yes, 32-bit ARM can take a hit if we have misaligned access. They can trap or not based on some CPU flag, and I've found that this flag is often set to trap.

Incidentally, just this past week I had to put out fires because of a C program that behaved differently on different platforms because of alignment. It reminded me how important it is for an allegedly-portable platform to have uniform semantics on such things.

As a user, I think that I'd prefer for my code to run slower on old ARMs than to exhibit different behavior. As an implementer, I wonder how many of the software is-aligned checks would remain if you speculate aligned and deoptimize on misaligned.

-Fil

On Jul 18, 2015, at 3:26 PM, titzer notifications@github.com wrote:

This issue needs some more investigation when we get closer to having a full-performance native implementation, but there is also a semantic question here.

At the risk of making the spec a little weaker, it'd be nice to quash a potential proliferations of slightly spec-noncompliant implementations. In particular, how far are we willing to let the polyfill get out of compliance in order to allow it to make use of the slightly different alignment semantics of asm.js?

On Tue, Jul 14, 2015 at 6:18 PM, Luke Wagner notifications@github.com wrote:

Well, the option I'd expect engines to instead take (e.g., that I would want FF to take) in this case would be to use dynamic branching to handle misaligned accesses; otherwise, spec-blessed or not, programs are going to fault when running on the platform (especially since we make misaligned accesses Just Work).

― Reply to this email directly or view it on GitHub https://github.com/WebAssembly/design/issues/105#issuecomment-121422827.

― Reply to this email directly or view it on GitHub.

lukewagner commented 9 years ago

@titzer I'm not sure in which direction you're talking about weakening the spec, but it sounds like allowing nondeterministic auto-alignment which seems like making the spec a lot weaker. As I was arguing above: regardless of what we specify, If only a few platforms exercise a certain nondeterministic path, then they're just as likely to be broken and are better off taking a speed hit to conform to the norm.

In the initial transition phase (when a significant % of the browser market does not have native wasm), the asm.js polyfill will be a tier 1 testing platform so errors will be caught just like they are caught today with pure asm.js (to make it easy to catch bugs, the polyfill can have a throw-on-misaligned option analogous to Emscripten SAFE_HEAP).

As native wasm support becomes ubiquitous, we might see apps coming out that include misaligned accesses (because they only tested native support) but, at that point, we could change the polyfill default to issue byte loads (so it also Just Worked, albeit more slowly). We could also mitigate this issue by having a "Strict mode" browser devtool option that we loudly and widely encouraged everyone to test with.

titzer commented 9 years ago

@luke

It only introduces nondeterminism if the program lies about its alignment annotations. If a program is conservative and always specifies unaligned, it will not experience nondeterminism.

We've had some input from some MIPS partners that they were worried their platform would be severely punished, since the alignment issue is complicated and hardware that handles unaligned access at full speed is still in the pipeline.

I'm not exactly sure what value the alignment attribute has if implementations will generate very similar code for both cases. E.g:

On platforms that have fast hardware-based unaligned support:

load[aligned] x y: mov %r0, [%r1 + %r2]

load[unaligned] x y: mov %r0, [%r1 + %r2]

On platforms with slow (trap-based) unaligned support:

load[aligned] x y: mov %r0, [%r1 + %r2]

load[unaligned] x y: if(not_aligned) goto out_of_line_code mov %r0, [%r1 + %r2]

On platforms with masking behavior: load[aligned] x y: if(not_aligned) goto out_of_line_code mov %r0, [%r1 + %r2]

load[unaligned] x y: if(not_aligned) goto out_of_line_code mov %r0, [%r1 + %r2]

The only difference I see is on the trap-based platforms, you roll the dice and let the program pay the big performance penalty. But if these platforms are not tested on often, then there will probably be a creep of programs that have unaligned accesses and never noticed because they always ran on the other platforms. In order to avoid that we'd have to specify a debug mode where unaligned accesses trap so that people could flush those bugs out early. But that is essentially specifying nondeterminism, but only in debug mode.

On Tue, Jul 21, 2015 at 8:20 PM, Luke Wagner notifications@github.com wrote:

@titzer https://github.com/titzer I'm not sure in which direction you're talking about weakening the spec, but it sounds like allowing nondeterministic auto-alignment which seems like making the spec a lot weaker. As I was arguing above: regardless of what we specify, If only a few platforms exercise a certain nondeterministic path, then they're just as likely to be broken and are better off taking a speed hit to conform to the norm. In the initial transition phase (when a significant % of the browser market does not have native wasm), the asm.js polyfill will be a tier 1 testing platform so errors will be caught just like they are caught today with pure asm.js (to make it easy to catch bugs, the polyfill can have a throw-on-misaligned option analogous to Emscripten SAFE_HEAP). As native wasm support becomes ubiquitous, we might see apps coming out that include misaligned accesses (because they only tested native support) but, at that point, we could change the polyfill default to issue byte loads (so it also Just Worked, albeit more slowly). We could also mitigate this issue by having a "Strict mode" browser devtool option that we loudly and widely encouraged everyone to test with.

— Reply to this email directly or view it on GitHub https://github.com/WebAssembly/design/issues/105#issuecomment-123424439.

lukewagner commented 9 years ago

We've had some input from some MIPS partners that they were worried their platform would be severely punished, since the alignment issue is complicated and hardware that handles unaligned access at full speed is still in the pipeline.

Are there any new constraints we need to be considering here wrt MIPS? I think experience with asm.js shows that misaligned (wrt their alignment annotation by LLVM) accesses are generally rare. Also, I don't see how adding nondeterminism makes the theorized problem go away since if the desired codegen is branching, you can always use branching with the deterministic Just Works semantics.

It only introduces nondeterminism if the program lies about its alignment annotations. If a program is conservative and always specifies unaligned, it will not experience nondeterminism.

But if unaligned offers performance advantages (and it would on trap-based platforms), then compilers would be encouraged to always emit aligned (when justified by the language semantics, so the LLVM alignment annotations) and thus most programs would have the nondeterminism.

But if these platforms are not tested on often, then there will probably be a creep of programs that have unaligned accesses and never noticed because they always ran on the other platforms.

This is going to happen regardless, unless we can get most people developing/testing in debug-mode. Given that misaligned accesses are going to happen, having it run, but execute slower seems like the best option (compared to crashing). Also, while an individual access may be 1000x slower, amortized over a whole program it's likely a a much smaller % and so there is a good chance the app will stay usable. Lastly, in the worst case, if this was becoming a problem in practice, trap-based platforms could always mitigate by switching to always-branch (or byte loads).

In order to avoid that we'd have to specify a debug mode where unaligned accesses trap so that people could flush those bugs out early. But that is essentially specifying nondeterminism, but only in debug mode.

The debug mode I was imagining would be a pure devtools option, not a mode in the wasm spec, and would deterministically fault on all misaligned accesses (and so have a small perf hit on x86). If it's a devtool, I don't think it counts as nondeterminism any more than a debugger changing values is nondeterministic.

jfbastien commented 9 years ago

The worry on these platform is that regular accesses either need to be split up into byte accesses and then merged, or signal handling must be used. This isn't a "pay for what you use" approach to performance: you may have no unaligned accesses and performance will suffer, or you'll need to use a signal handler which folks have said they don't want to mandate. See the Linux MIPS docs for details.

lukewagner commented 9 years ago

@jfbastien Yes, but what is the nondeterminism buying us in those cases? If you have to branch on misaligned access anyway then you can just as well implement Just Works as something else nondeterministic. The only case I can see nondeterminism buying something is for auto-aligning platforms which would not otherwise have to branch. Is this the MIPS use case?

lukewagner commented 9 years ago

... and that is just from the performance perspective. From the perspective of "I want apps that run on other platforms correctly to also run on my auto-aligning platform correctly", then you don't want to be the one oddball platform that auto-aligns; of course apps are going to randomly break for you. That's why I was saying above (and iiuc @pizlonator was also saying) that, even if nondeterminism was a choice, I'd still want to implement Just Works semantics just to minimize bustage.

pizlonator commented 9 years ago

Do we have data on what the penalty for misaligned-accesses-do-weird-things platforms will be, if we require misaligned accesses to just work, but then also roll up our sleeves and actually optimize that case? I’ve been pondering this a bit. If you have profiling that tells you what the low bits of a pointer tend to look like, then you can emit optimized code that is biased for either aligned or misaligned, and you could even speculate that the pointer was already aligned which allows you to blow away repeated alignment checks on that pointer - and probably alignment checks on most pointers derived from that one, if the derivatives are just “ptr + C” where C is a multiple of the appropriate word size.

Since we probably do not have such data, it seems we have the following to choose from, and the following mitigations in a subsequent version if the performance isn’t good enough: 1) MVP only has access modes that Just Work when misaligned, old ARM and MIPS be damned. Future versions introduce new access modes, which allow for better performance on old ARM and MIPS. 2) MVP only has access modes that Trap when misaligned, x86 and ARM64 be damned. Future versions introduce new access modes, which allow for better performance on x86. 3) MVP only has access modes that are undef when misaligned. Future versions nail down the undef to mean either “Just Work” or “Trap”, depending on our empirical findings.

I prefer (1) because it’s the most forward-looking. I like (2) more than (3) because undef has a high likelihood of causing confusion for developers.

-Filip

On Jul 21, 2015, at 1:04 PM, Luke Wagner notifications@github.com wrote:

... and that is just from the performance perspective. From the perspective of "I want apps that run on other platforms correctly to also run on my auto-aligning platform correctly", then you don't want to be the one oddball platform that auto-aligns; of course apps are going to randomly break for you. That's why I was saying above (and iiuc @pizlonator https://github.com/pizlonator was also saying) that, even if nondeterminism was a choice, I'd still want to implement Just Works semantics just to minimize bustage.

— Reply to this email directly or view it on GitHub https://github.com/WebAssembly/design/issues/105#issuecomment-123462529.

sunfishcode commented 9 years ago

C/C++ developers can also catch misaligned accesses by using UBSan (aka -fsanitize=undefined) (clang, GCC).

sunfishcode commented 9 years ago

I agree with what's said above; nondeterminism in anything other than trapping-or-not doesn't help much because it just converts applications that were slow on said architectures to applications that behave wrong on the same architectures.

I still believe "it's nondeterministic whether misaligned accesses trap" (misaligned means dynamic alignment is less than static alignment) is worth considering if we can't do "everything always just works". Implementations on MIPS/etc. might then choose to have two modes, "fast" (traps) and "slow" (branches). "fast" could be the default, and when a program traps (which should be rare), the implementation could (for example) automatically restart the program, blacklisting it to "slow" mode thereafter (for example). Blessing this in the spec means that spec conformance can remain something which is done by default. And this approach would mean that there's no mandate to catch and handle signals, and it would permit "pay for what you use", addressing two of @jfbastien's concerns above.

ARMv5 would just have to do "slow" mode, but there's a fair amount of agreement here that ARMv5 is old and not worth complicating the spec for.

titzer commented 9 years ago

The other important implementation that does masking (i.e. forcible alignment) is the polyfill to asm.js. If we go with "always works", then the polyfill is going to be incorrect for misaligned accesses. How strongly do we value the correctness of the polyfill? Or conversely, how specially do we treat the polyfill in comparison to any other implementation? When a spec comes, will we need to add special exceptions for it, or will it remain spec incompliant?

On Tue, Jul 28, 2015 at 4:36 AM, Dan Gohman notifications@github.com wrote:

I agree with what's said above; nondeterminism in anything other than trapping-or-not doesn't help much because it just converts applications that were slow on said architectures to applications that behave wrong on the same architectures.

I still believe "it's nondeterministic whether misaligned accesses trap" (misaligned means dynamic alignment is less than static alignment) is worth considering if we can't do "everything always just works". Implementations on MIPS/etc. might then choose to have two modes, "fast" (traps) and "slow" (branches). "fast" could be the default, and when a program traps (which should be rare), the implementation could (for example) automatically restart the program, blacklisting it to "slow" mode thereafter (for example). Blessing this in the spec means that spec conformance can remain something which is done by default. And This approach would mean that there's no mandate to catch and handle signals, and it would permit "pay for what you use", addressing two of @jfbastien https://github.com/jfbastien's concerns above.

ARMv5 would just have to do "slow" mode, but there's a fair amount of agreement here that ARMv5 is old and not worth complicating the spec for.

— Reply to this email directly or view it on GitHub https://github.com/WebAssembly/design/issues/105#issuecomment-125414141.

sunfishcode commented 9 years ago

There is a plan for the polyfill. It's a little awkward, but it's an attempt at a practical strategy to break with JS semantics in certain key areas.

If an implementor is thinking "the polyfill masks addresses, so why shouldn't I do it too?", we'll remind them that any time the polyfill's alignment masking actually affects anything, then the program doesn't work right under the polyfill. "Program doesn't work right" isn't something that we anticipate implementors should need to emulate [0].

[0] And we aren't worried about programs coming to depend on the polyfill semantics either, because we already know that popular native wasm implementations won't be masking.

paul99 commented 9 years ago

@tizer asked me to comment here, I work at MIPS/Imgtec on V8.

As discussed above, existing MIPS cores trap on unaligned accesses. Any remotely modern kernel will fixup the un-aligned load/store (same result as x86). It just works, but these accesses are slow.

Newer cores (in development) will support unaligned accesses in hardware.

Of course, code that claims [aligned=true] but lies could tank performance.

Detection and deoptimization to safe accesses would be trivial with a signal handler (though we have avoided those due to concerns with sandboxing, etc.) There are pure software methods discussed by others above.

So MIPS does not introduce indeterminism, and the performance impact of 'Just Work when misaligned' can be mitigated over time.

The debug-mode dev tool support would be excellent.

WebAssembly / design

Alignment will probably require implementation-defined behavior #105