WebAssembly / design

WebAssembly Design Documents
http://webassembly.org
Apache License 2.0
11.37k stars 691 forks source link

Partially out-of-bound writes #1490

Open sparker-arm opened 8 months ago

sparker-arm commented 8 months ago

The semantics of partially OOB writes was clarified in #1025. This has proved problematic for runtimes running on Arm, where there is no such guarantee in the architecture. I think all now browsers implement the spec properly, but I'm not sure if any of the wasm-specific non-browser runtimes do.

I could see that having deterministic memory contents is a nice property, but I'm not sure of what the real advantages are, could someone enlighten me?

And do you think there's any room to relax the spec in this regard, possibly even through an extension like relaxed-simd?

conrad-watt commented 8 months ago

As one point, deterministic behaviour helps to generally ensure that different implementation decisions in different engines don't lead to observably divergent behaviours (even on a single underlying system). Presumably this would be relevant here when comparing engines implementing load/store with explicit bounds check vs guard pages.

How do the browsers using guard pages currently achieve conformant behaviour on Arm?

sparker-arm commented 8 months ago

I'm not sure for Firefox: here's a lengthy thread about the issues they've found and it's not just Arm specific.

For V8, though I don't think I've actually enabled it yet for Chromium, we can rely on trap handlers for loads but have to perform manual bounds checks for stores.

deterministic behaviour helps to generally ensure that different implementation decisions in different engines don't lead to observably divergent behaviours

Sure, but I'm still struggling to understand in what situations it is useful, I don't understand how it helps once we trap. Can it help in some debug situations?

conrad-watt commented 8 months ago

Sure, but I'm still struggling to understand in what situations it is useful, I don't understand how it helps once we trap. Can it help in some debug situations?

Sorry, I rushed to write the previous comment and didn't work through the reasoning properly. On the Web, it's possible to catch Wasm traps in JS and continue execution (and observe the state of the memory). We have a lot of previous experience with tiny edge-cases of the semantics being relied on by some Web program so there would be some inertia against introducing further non-determinism here. That being said, if browsers currently aren't (or even can't) implementing this conformantly (as the Firefox issue above suggests), that would be an argument that we could relax things, since existing Web programs couldn't be relying on conformant behaviour.

sparker-arm commented 8 months ago

Okay, thanks.

From bug trackers though, it seems it only became 'a problem' once the necessary tests were added to the wasm spec test-suite. And I'd just like re-iterate that most (all) wasm runtimes only pass this test, on Arm hardware, because of microarchitecture details (luck!) so it's not just browsers who don't rely on conformant programs... it's basically everything. It seems it is both difficult to implement two policies and it also mean giving up ~10-20% execution performance when using explicit checks.

conrad-watt commented 8 months ago

It seems it is both difficult to implement two policies and it also mean giving up ~10-20% execution performance when using explicit checks.

Yeah, if being truly conformant requires engines to totally give up on the guard page strategy, I would (personally) count that as "can't implement conformantly" and (personally) argue that the semantics should be relaxed.

@eqrion am I understanding correctly that today in "release" Firefox, it's possible to craft a program that (in some platform) can observably write bytes to memory in the case of a partially OOB write?

rossberg commented 8 months ago

Even if we were to relax this, we would need to fix a behaviour in deterministic mode, which some platforms have a hard reliance on. Is the only solution for them to not use guard pages, or not use Arm?

sparker-arm commented 8 months ago

I don't have an answer of how else to get around it... Maybe the penalty of explicit checks is worth the absolute determinism? Having to exclude an underlying hardware implementation doesn't sound very wasmy. Would you mind sending me some link(s) to the aforementioned platforms? I would really like to know about some concrete use-cases.

EDIT: On a single aarch64 Linux system, I found the execution time overhead of explicit checks, for stores only, to be ~5% across an arbitrary set of benchmarks.

rossberg commented 8 months ago

One particular sensitive class of examples is Blockchain platforms running Wasm, because they rely on consensus between replicated distributed computations, possibly across heterogeneous hardware. That is only meaningful if all executions are guaranteed to be fully deterministic, including in cases where traps occur. There is bunch of these platforms, such as Dfinity, Parity, etc.

sparker-arm commented 8 months ago

Thank you! I feared blockchain maybe the answer :)

So, sorry to get a little side-tracked, but I think I'm getting lost among the jargon... I see you are/were affiliated with Dfinity, so maybe you know for them, does their platform run on aarch64? And do you know whether most of these platforms are using wasmtime/cranelift? (IIRC, cranelift does not produce conforming code.)

I can see that Parity have their own interpreter, do you know if rolling-your-own interpreters/compilers is a popular choice?

tschneidereit commented 8 months ago

Does the blockchain use case require consensus on failed executions? I'd have expected that a result is thrown out if it ends in a trap, and that that is effectively the way to get to deterministic outcomes for that case.

titzer commented 8 months ago

@sparker-arm Can you clarify which arm architectures might be affected? Does this apply to modern processors and arm64, or is this limited to certain classes of arm processors?

grubbymits commented 8 months ago

Can you clarify which arm architectures might be affected

To be pedantic, no arm architecture provides the guaranteed semantics. It is implementation defined. The problems, generally, do not arise during testing because it's either running in an emulator or on a microarchitecture that does provide the semantics. For arm designed cores, it can be loosely described as the big cores match wasm but the small cores do not. I don't have an exhaustive list. Apple silicon doesn't suffer from this inconsistency.

rossberg commented 8 months ago

@sparker-arm, afaik, Dfinity currently only runs on x64 nodes, but it is not intended to stay like that. For their use case at least, an interpreter would be too slow, because it's a 3rd gen blockchain that is meant to support full-featured on-chain applications.

@tschneidereit, consensus also applies to trapping executions, since they have to materialise as a rejected message at the call site (for distributed message calls), so are an integral and well-defined part of the execution model.

That said, I'm sure there are other use cases for deterministic execution that do not care about the failure case. Reproducible execution/builds could fall into that category (not sure).

fitzgen commented 8 months ago

it also mean giving up ~10-20% execution performance when using explicit checks.

FWIW, the overheads can be much greater than that, depending on the benchmark. I've measured ~1.55x slow downs in both Wasmtime and SpiderMonkey on real world programs (spidermonkey.wasm running a JS markdown renderer).

cfallin commented 8 months ago

@sparker-arm Is there any feature flag or other way to detect whether the underlying hardware does a partial write or not (other than, say, the runtime performing one and catching the signal at startup)? Is it at least guaranteed that the hardware behaves the same way consistently? (Would big.LITTLE throw a wrench into that perhaps?)

I'm also curious about other architectures: does anyone know offhand if e.g. RISC-V or other ISAs guarantee anything about partially-trapping writes?

cfallin commented 8 months ago

@sparker-arm one other thought: is it sufficient, architecturally, to do a load (and throw away the result) to a stored-to address before every store? We would then take the fault on the load instead, without side-effects. Or does one need a fence to ensure the store doesn't reorder in that case?

sparker-arm commented 8 months ago

Would big.LITTLE throw a wrench into that perhaps?)

Yup! As my alter ego 'grubbymits' accidentally mentioned, I believe almost all configurations of big and small would be incompatible. I have a feeling there maybe a single big armv8 that behaves differently to the rest.

sparker-arm commented 8 months ago

@sparker-arm one other thought: is it sufficient, architecturally, to do a load (and throw away the result) to a stored-to address before every store? We would then take the fault on the load instead, without side-effects. Or does one need a fence to ensure the store doesn't reorder in that case?

I don't know, I need to speak to one of our memory model experts. I suspect that it may not work without atomics.

cfallin commented 8 months ago

Thinking about the load-before-store idea a bit more: a page fault is a precise exception, in the sense that all side-effects of earlier instructions should occur, and none of later instructions should, correct? (The concern here is just about precision within the sub-operations of a single instruction's write.) In other words, if I have a load that faults on page A, and then later in the program, a store to address B, the architecture should unambiguously state that the store to B does not occur (or may occur speculatively, but we're only concerned with final architectural state here)? Otherwise, you couldn't restart after a page fault, because A and B might alias.

sparker-arm commented 8 months ago

Even if this does work, I don't really think we should be coming up with hacks to get WebAssembly to run well on a very prominent architecture. It seems clear that the design decision was made without having enough information, and it appears that most runtimes continue to ignore the spec, so maybe the spec should change...

From the crypto use-case, it seems they all provide an SDK so they can fully control the target wasm feature set, and so they can disable anything that could introduce in-determinism. If this is the main case for fully deterministic memory, I wonder if we could have this as an opt-in feature?

rossberg commented 8 months ago

@sparker-arm, deterministic mode (introduced with the Relaxed SIMD proposal, which is similarly messy) is what I was getting at. But we'll still need to specify some concrete behaviour for that. I am trying to understand if this mode would even be implementable on Arm without severe performance hits, and what the cheapest semantics would be for that.

sparker-arm commented 8 months ago

Well, is there any way that we can implement alignment guarantees in WebAssembly? (I'm not sure I understand the utility of alignment hints.)

akirilov-arm commented 8 months ago

@cfallin

Is there any feature flag or other way to detect whether the underlying hardware does a partial write or not (other than, say, the runtime performing one and catching the signal at startup)? Is it at least guaranteed that the hardware behaves the same way consistently?

Sadly, the answer to all your questions is no. And yes, architecturally speaking there is no requirement that a particular implementation/microarchitecture/CPU behaves consistently, and that is even if we ignore the existence of technologies such as big.LITTLE (e.g. a single-core system).

cfallin commented 8 months ago

Even if this does work, I don't really think we should be coming up with hacks to get WebAssembly to run well on a very prominent architecture. It seems clear that the design decision was made without having enough information, and it appears that most runtimes continue to ignore the spec, so maybe the spec should change...

It'd still be useful from my perspective at least to have a definitive answer on the load-before-store idea: if the spec doesn't change, then this is the only reasonable option I see to keep competitive Wasm performance on aarch64, unless I'm missing some other option!

akirilov-arm commented 8 months ago

Also, to be precise, on AArch64 the behaviour is not implementation-defined in the sense that an implementation is supposed to choose a specific behaviour and to stick to it - architecturally the contents of the in-bound portion of the memory affected by a store become unknown (so in theory they could become zero, for instance, even if the value to be stored is non-zero).

sparker-arm commented 8 months ago

consensus also applies to trapping executions, since they have to materialise as a rejected message at the call site

@rossberg I'm still trying to understand the mechanics for consensus, do the linear memory of the instances have to be compared?

rossberg commented 8 months ago

@sparker-arm, sort of. In the abstract, each node computes some form of cryptographic hash of the state changes it has performed, including the memory contents. Virtual memory techniques to identify dirtied pages at the end of execution make that practical and fairly efficient even for large memories, because the number of touched pages is bounded by the gas limit.

That said, we shouldn't narrow the consideration to the obscure needs of blockchains. Portable, deterministic execution and the absence of undefined behaviour (even where non-determinism is allowed) has been one of the original design goals and a big selling point of Wasm. We somehow need to maintain it. Demoting determinism to an opt-in mode already is a compromise that not everybody was happy with.

sparker-arm commented 8 months ago

Okay, so I'm now trying to reason what is different about this case of non-determinism compared to the existing standard, and incoming extensions. Given that wasm can produce non-deterministic NaN values, this case doesn't feel wildly different to me.

I see that NaN values are normalised with instrumentation in the Dfinity SDK/runtime? Could partial writes be handled in the same way? Maybe even by breaking up stores into byte stores? It just seems there's existing precedence for working around the limitations for runtimes that need it.

rossberg commented 8 months ago

It's in fact Wasmtime that does the NaN normalisation, it has a configuration option for that. That roughly corresponds to the idea of having a deterministic mode, which this engine (and possibly others) supports.

It would be okay-ish to only have deterministic behaviour of OOB writes in deterministic mode. But even then we need to resolve two questions:

  1. In "performance" mode, the behaviour must still be well-defined. In the worst case, this definition could amount to complete non-determinism (could write any value whatsoever), but preferably, we should narrow it down more than that.

  2. For deterministic mode we must specify a fixed behaviour. This behaviour should be implementable without excessive overhead and without burdening engines significantly. Breaking up all stores into individual bytes does not qualify, AFAIAC.

conrad-watt commented 8 months ago

In "performance" mode, the behaviour must still be well-defined. In the worst case, this definition could amount to complete non-determinism (could write any value whatsoever), but preferably, we should narrow it down more than that.

IMO we should solve this the same way we planned to solve bulk memory + mem.protect a while ago. Independently nondeterministically choose whether or not each in-bound byte is written. I don't think we should write a spec that admits the Aarch64 "thin air" behaviour.

EDIT: actually, are there any implications here for the "last" write of a bulk memory op?

EDIT2: first instinct: no, due to the eager bounds check on bulk ops

pepyakin commented 8 months ago

There is bunch of these platforms, such as Dfinity, Parity, etc.

I worked on that part of Substrate and Polkadot (which developed by Parity) and thought I would chime in.

> Does the blockchain use case require consensus on failed executions? I'd have expected that a result is thrown out if it ends in a trap, and that that is effectively the way to get to deterministic outcomes for that case. Exactly, after a trap we consider the instance to be ruined and we do not attempt to perform any execution with such an instance. The typical toolchains do not (and perhaps cannot) leave the memory in a coherent state after a trap (simplest example here would be the shadow stack pointer leaking after a trap). We do not read the memory contents of the ruined instance either. > I can see that Parity have their own [interpreter](https://github.com/paritytech/wasmi), do you know if rolling-your-own interpreters/compilers is a popular choice? wasmi was the first Rust interpreter at the time (2017). Substrate and Polkadot switched to wasmtime as soon as (or even a bit before) it has matured with wasmi being used for some niche use-cases.
sparker-arm commented 8 months ago

Just noting that this is also an issue for RISC-V, as reported here.

sparker-arm commented 8 months ago

For deterministic mode we must specify a fixed behaviour. This behaviour should be implementable without excessive overhead and without burdening engines significantly.

Then, could we have the trap handler rewrite the maybe-partially-written memory to a specified value?

EDIT: If I understand the wording of the spec, once a trap instruction executes, the whole program is reduced to a trap. So, couldn't we just (re)set the linear memory to a predefined state? I assume it has a defined initial state?

cfallin commented 8 months ago

Then, could we have the trap handler rewrite the maybe-partially-written memory to a specified value?

That would require saving the old value off somewhere preemptively, thus adding a bit more overhead than we'd like. Actually it's strictly worse than the load-before-store idea, AFAICT (since both require a load, but the trap-handler-fixup technique also needs a side-undo-buffer).

sparker-arm commented 8 months ago

Sorry, I meant a wasm spec predefined value, not just fixing up the memory with the previous value.

rossberg commented 8 months ago

Wouldn't that amount to out-of-thin-air behaviour?

sparker-arm commented 8 months ago

Wouldn't that amount to out-of-thin-air behaviour?

'A trap results if any of the accessed memory bytes lies outside the address range implied by the memory’s current size. Any bytes inside the address range are set to X'

Would that be defined as thin-air?

cfallin commented 8 months ago

I suppose @sparker-arm means something like all-zeroes for the inbounds part? It's not super-elegant from a pure semantics point of view but I actually like one practical aspect of it: all of the overhead is on the trapping path, and it's easy to have the same deterministic behavior on all architectures (including those that do trap without any partial write).

AFAICT we don't have any other option on the table if we (i) need to match results across machines, (ii) want to provide some memory state after trap (modulo the masked/overwritten part), (iii) have machines that actually do the partial writes, and (iv) want to retain high performance without explicit bounds checks? Otherwise:

So maybe it's the best option?

sparker-arm commented 8 months ago

I guess it's mostly a theoretical ISA spec issue, unless there are small cores that do actually do this?

All little cores, AFAIK. The big cores from our Sophia design team would also break wasm, which I believe is the Cortex-A73 and A75.

disallow any access to the state of memory after a trap; but that's far too useful to give up IMHO

Just wondering whether you have use for it outside of debug? I would definitely like clarification of what the state of the store should be when a program is reduced to a single trap.

rossberg commented 8 months ago

@sparker-arm, not sure. It would be a value that was neither in the memory before the write nor was written to it. Although thin-air semantics typically means arbitrary value, while this would be a fixed one.

@cfallin, (i) does not answer the question what to do in deterministic mode, (ii) would be a severe breaking change -- it is completely legal today to reenter Wasm and access memory after traps, and it appears highly unlikely that no existing web page or application does that in the case of OOB. It would also have quite non-trivial consequences for the relaxed memory model, similar to "shrinking" a live memory.

sparker-arm commented 8 months ago

it is completely legal today to reenter Wasm and access memory after traps, and it appears highly unlikely that no existing web page or application does that in the case of OOB.

But is the state of memory, after a trap, actually well-defined by the spec? This is what I've found:

A trap results if any of the accessed memory bytes lies outside the address range implied by the memory’s current size.

The execution of an instruction may trap, in which case the entire computation is aborted and no further modifications to the store are performed by it. (Other computations can still be initiated afterwards.)

The trap instruction represents the occurrence of a trap. Traps are bubbled up through nested instruction sequences, ultimately reducing the entire program to a single trap instruction, signalling abrupt termination.

This doesn't sound like the memory can be inspected to figure out what the program did before the trap executed.

rossberg commented 8 months ago

But is the state of memory, after a trap, actually well-defined by the spec?

Yes, absolutely. The (non-normative) bits you quote just give a general intuition that a trap aborts the current computations, including any effect the current instruction might otherwise have.

The normative specification for executing stores is here. In particular, the bounds check is performed and results in a trap (step 12a) before any modification of the state is performed (step 15).

(This spec becomes much more complicated with the relaxed memory model in Wasm 3+, but the result is the same as fas as single-threaded memory is concerned. For shared memory, it also specifies that no mutation can be observed by concurrent reads, and that OOB checks are implicitly atomic.)

sparker-arm commented 8 months ago

Thanks, but it's the notion of 'ultimately reducing the entire program to a single trap instruction', which I'm struggling with. Surely a program which is reduced to a single trap shouldn't have any affect on memory at all?

rossberg commented 8 months ago

Ah, yes, I can see how that is confusing. "Program" is meant here in a more formalistic sense, being the current sequence of instructions executed (since entering Wasm), which is "reduced" to a result per a small-step reduction semantics. It doesn't mean that any actual code disappears, that still exists in the function store of the abstract machine and can be reinvoked.

tlively commented 8 months ago

Allowing partially OOB writes to nondeterministically write any value to the in-bounds region, then tightening that up to say they always write zeroes to the in-bounds region under the deterministic profile sounds ok to me.

As an alternative for the deterministic profile, would it be feasible to say that the in-bounds portion of the write will succeed? On the trapping path, the write would essentially have to be repeated byte-by-byte for the in-bounds region, but at least there's no need to know what the previous value in memory was under that scheme.

rossberg commented 8 months ago

@tlively, I think the problem is that the trap handler has no access to the value that was attempted to be written.

tlively commented 8 months ago

I was assuming that the written value would be easier to provide to the trap handler than the previous in-memory value, but maybe it's still too complicated / slow to arrange for that.

cfallin commented 8 months ago

At least in Wasmtime, it's conceivable that we could include enough information in the trap metadata to replay the store byte-by-byte without having to include a full instruction disassembler in the segfault handler, starting from the register state. That would allow a "zero-cost" approach (in the same sense as for exceptions) where there's no action to save the store data on the common path. I think we'd still prefer the fixed-value (zeroes) approach to avoid that complexity if it's palatable, though I do understand that "in-bound part of the store succeeds" is much cleaner semantics.

rossberg commented 8 months ago

@cfallin, yeah, I can believe that there is a plausible implementation strategy for some engines on some hardware platforms, but as long as it can't easily be done on all relevant hardware platforms, it would seem unwise to prescribe this semantics.

titzer commented 8 months ago

@cfallin AFAICT It's possible to do this without metadata by decoding the faulting machine store instruction and getting the supplied value from either an immediate or a register (maybe memory, if it's a RMW, but that probably won't work if its a RMW of the same memory location, since per discussion thread the contents of the memory location are now unspecified on arm). Also, given that many different kind of store instructions could trap, the instruction decoder for that could be quite unwieldy.

Another option is to use metadata to encode the original (bytecode) PC and then replay the bytecode in an interpreter or JIT a stub. Those are probably out of wasmtime's design criteria, though. Although, given that there is probably a finite number of combinations of stores + regs, precompiling all such stubs might be possible.