WebAssembly / threads

Threads and Atomics in WebAssembly
https://webassembly.github.io/threads/
Other
682 stars 55 forks source link

Resizing details / underspecification #26

Open lars-t-hansen opened 7 years ago

lars-t-hansen commented 7 years ago

The section on resizing is underspecified. Some questions that come to mind:

@jfbastien also noted that ES2017 algorithms for atomics may or may not be correct if the SAB length or the buffer's length can be updated in the middle of an operation, and more generally also pointed out that even if no implementations will implement growable SABs by moving the memory, we need to spell out what happens if a resize races with a read or a write (or we need to spec "as if memory is never moved" for the MVP, which seems OK).

binji commented 7 years ago
  • If agent A calls grow_memory, and lets agent B know that it has done so by writing something into shared memory (maybe racily), can agent B when it sees that write then assume that it can access memory up to the new limit, or must agent B do something special (eg call current_memory) before it can observe the new size?

I think it is reasonable to assume that the length is shared and updated atomically inside the wasm module, so there shouldn't need to be a synchronization step. But we need to be careful that this doesn't paint us into a corner in terms of implementation. For example; how do we deal with this if the engine uses bounds checks baked into the code for memory accesses? I guess the simplest thing to do is have a two-step bounds check -- if the check fails, then read the shared length and update the code.

  • Is agent A's growing of the memory reflected in the length fields of the SABs in all the agents? If so, is reading that length field an atomically ordered operation? Otherwise, is there a way to obtain new SABs that reflect that length?

I think SABs should not have their lengths change, and the only way to get an SAB with the new length is via WebAssembly.Memory.buffer. Not sure if that's prohibitive though.

  • When agent A and agent B both call grow_memory at the same time with non-equal arguments, what happens? Cumulative growth seems most reasonable.

Agreed, cumulative growth is the only thing that makes sense to me here.

@jfbastien also noted that ES2017 algorithms for atomics may or may not be correct if the SAB length or the buffer's length can be updated in the middle of an operation...

I believe this check always happens in ValidateAtomicAccess, which requires a typedarray, not a SAB. So even if we decide that the SAB length changes, the typedarray length won't change in the middle of an operation. Of course this pushes the question to the typedarray constructor and anything else that reads [[ArrayBufferByteLength]]. Looks like that is just get SharedArrayBuffer.prototype.byteLength, SharedArrayBuffer.prototype.slice and the DataView constructor. I don't think we have a way to say that the implementation should atomically read this property, though -- seems best to just step around it by saying it doesn't change for a given SAB.

lars-t-hansen commented 6 years ago

I'm in the process of implementing this (not quite done yet but close enough) and here's the design I've ended up with. The SAB object holds the length of the underlying memory buffer, but the buffer itself also has an atomic length field.

Wasm ignores the SAB object. current_memory queries the length of the underlying buffer. grow_memory acquires a lock (associated with the buffer in my code, but could be global) and updates the buffer's length field (and commits additional pages) with the lock held. The new length is written at the very end of the grow process, after pages are committed, so there is a window of time during which another thread can observe that accesses beyond a current_memory value that it previously read will not trap; on the other hand, if an access does not trap then a subsequent read of current_memory will never reveal a value that is lower than the non-trapping address. I think this is the best we can hope for.

When the SAB's length is read, only the SAB's field is queried, and this field never changes. When we read the Memory object's buffer, a check is to see if the SAB's length equals the buffer's length, and iff they are not equal, a new SAB object (caching the new length) is created, stored in the Memory object, and returned. Otherwise the existing SAB object stored in the memory object is returned.

jfbastien commented 6 years ago

@lars-t-hansen does your implementation ever need to move the buffer to grow?

lars-t-hansen commented 6 years ago

No. For shared memory we reserve the memory up to the max (or to the usual 6GB range on x64) and just commit as the memory grows.

jfbastien commented 6 years ago

I think the final design should allow for implementations that require moving the memory, as well as for those that never move. I'm not 100% on what that would change, though.

lars-t-hansen commented 6 years ago

OK. After discussing this with Luke I'm coming around to the idea that even reading the SAB's length field should acquire the lock, this will tighten things up in the non-moving case. After that it's likely that the signal handler will have to deal with the case where we observe a protection fault while some other thread is in the process of growing the memory, but this may just amount to retrying the operation while the lock is held by some other thread.

binji commented 6 years ago

I think the final design should allow for implementations that require moving the memory, as well as for those that never move.

It seems as many designs can work (including this one) if the VM stops the world to grow memory. Are you talking about something more clever than this?

... even reading the SAB's length field should acquire the lock...

Why? What you suggested previously made more sense to me, and was how I imagined it would be implemented.

jfbastien commented 6 years ago

After that it's likely that the signal handler will have to deal with the case where we observe a protection fault while some other thread is in the process of growing the memory, but this may just amount to retrying the operation while the lock is held by some other thread.

Would the implementation be required to install a signal handler? I don't think this is something we want to mandate. It doesn't seem to be the case, but maybe I'm missing something subtle?

lars-t-hansen commented 6 years ago

@jfbastien, I don't think a signal handler would be required in any sense. Everything can be done with explicit bounds checks if necessary and as Ben says, if other threads are parked when one thread is growing memory there are no tricky situations.

@binji, With the caveat that this code has not been written yet, I'll update as I dig into this more:

Suppose the memory buffer's length field is simply atomic and can be read without acquiring a lock, but that setting it as part of the grow operation requires taking a lock. Now the length field can be set (within grow_memory) early, before commiting the new pages; or it can be set late, after committing.

Suppose agent A is growing memory while agent B is accessing memory and reading current_memory.

Suppose the length field is set early. Now if B is careful to first query current_memory and then perform an access only within bounds returned by current_memory it can get into the situation where what it thought was a legal access traps (because length was updated early but pages have not committed yet).

Suppose the length field is set late. Now B can perform an access that does not trap, and then query current_memory, and observe that the access should have trapped (because length has not been updated yet).

We can decide to allow both of these situations to occur (because accessing current_length during grow_memory is not expected to be stable, so buyer beware), or we can settle on one that we would like to allow (to reduce inter-implementation variability), but it would be better if we could avoid the problem altogether (to remove inter-implementation variability entirely).

The first situation above can be avoided by having the signal handler / bounds check trap handler observe that we are in the process of committing memory and that we should wait or retry until the commit is finished. This amounts to the handler being aware of the lock that grow_memory takes, when it queries the length of memory for the purposes of obtaining the bound.

The second situation can be avoided if the call to current_memory acquires the lock, because it will then wait for the length to be updated.

In summary, having the committing of pages and the setting of the length appear as one atomic operation is probably the best, and doing so means that reading current_length must block on any in-progress commit finishing.

binji commented 6 years ago

OK, I see what you're getting at. The part that confused me was that you said the "SAB's length field" which made me think you were talking about the JS object's field, which shouldn't be changing as a result of grow_memory at all.

lars-t-hansen commented 6 years ago

Sorry about that. Sometimes I even confuse myself :)

Thinking about this a little more, consider an implementation that has explicit bounds checks and a non-moving buffer where the memory is pre-committed up to the max bound (no VM tricks, no signals). Except in the case where a single atomic cell holds both the buffer's length and the bounds check limit for all agents, there will be several memory locations to update when the buffer grows, and unless these updates appear atomic to all agents there is room for inconsistent views.

binji commented 6 years ago

Yes, though I think you could do this all lazily. Not sure if it would be worthwhile, but it may allow you to avoid stopping all threads. When one thread calls grow_memory, it atomically updates the memory buffer's length. (BTW, I'm assuming the explicit bounds checks are baked in, but I think this works even if they're not.) If the bounds check fails, then you atomically load the memory buffer's length and see if the bounds check still fails. If not, then you lazily update all the bounds checks. I believe it's OK even if this is racy, as the bounds checks are just a best guess at the current limit, and never relied on to be the source of truth.

AndrewScheidecker commented 5 years ago

In summary, having the committing of pages and the setting of the length appear as one atomic operation is probably the best

I think this is only meaningful if memory size must monotonically increase. In the scenario you described,

Now B can perform an access that does not trap, and then query current_memory, and observe that the access should have trapped (because length has not been updated yet).

if the memory size can also shrink then querying memory.size after another memory access does not give you an upper bound for the memory size during the preceding memory access. So you can't infer that you've observed a partially committed memory resize: you would observe the same thing if another thread shrunk the memory in between the memory access and the memory.size query.

And if you add memory.protect, then you can't distinguish somebody randomly memory.protecting pages at the end of the memory from a partially committed grow or shrink.

conrad-watt commented 5 years ago

I think there are two concerns here that can be distinguished. It's true that, thread-locally, there is no ahead-of-time test that can guarantee future writeability of a location in the presence of hypothetical arbitrary protect or shrink. I think we just have to accept this, and I agree that the trapping behaviour in this situation doesn't tell you anything about what combination of grow/protect/shrink was enacted by the environment, no matter how atomic we make them.

The other concern is facilitating whole-program/execution reasoning about the behaviour of memory. If I'm not importing an arbitrary shared memory from an arbitrary environment, but instead I'm a compiler for a whole program, or working in an environment that I control or have some invariants/assumptions about, then I know what grow/protect/shrink operations other threads could be carrying out. In this case, the specification has a choice about the space of interleavings/whole program executions it allows. Abstractly making memory.grow entirely atomic reduces this space of possibilities, and disallows some astonishing executions that would occur otherwise. If it can be done without a performance hit to the non-trapping case (deferring bounds-checking/lock observation to the trap handler/failure case), I think this is valuable. If this creates real implementation pain, it could be revisited.

AndrewScheidecker commented 5 years ago

Programs will need to avoid races between memory accesses and resizing or changes in protection. A program that is racing a memory access with a memory.grow is probably incorrect, regardless of whether the memory.grow executes atomically or not.

And since it's easier to start with a non-deterministic spec and later add determinism than to go the other way, it's better to err on the side of non-determinism unless there's a clear need for determinism.

I'm also not sure if there's a simple technique to implement atomic resizing with acceptable performance (i.e. better than taking a contended lock for every memory access). Seems like the two technique that will be fast for non-resizing code are for the resizing thread to pause all other threads that might access the memory, or to defer locking until the signal handler and try to hide traps that would expose inconsistent states. Are there any other techniques?

Here's a straw-man decomposition of memory.grow (i32.const n) as multiple atomic events:

Event Actions
0 Lock the memory's resizeAndProtectMutex.
1..n Commit each page from memory.numPages to memory.numPages+n-1 as a separate event, in non-deterministic order.
n+1 Set memory.numPages to memory.numPages+n, and unlock the memory's resizeAndProtectMutex.
conrad-watt commented 5 years ago

I'm going off @lars-t-hansen's conclusion above that this can be implemented without impacting the non-trap case. I don't think any of the solutions described above require a lock on every access, which I agree would not be acceptable. Another fast technique, as discussed above, would be to pre-allocate up to the max memory bound and have a location tracking the current length.

I actually wonder if the bounds check in this case could always be a bare load, (avoiding @binji's trick), so long as the length update at the point of the grow is done as a (C++-style) atomic RMW/CAS with a barrier, although I'd need to examine the architectural models in detail to be sure. EDIT: works on x86 but don't think it does on ARM without tweaks.

Maximal non-determinism is only an unqualified good for implementers. For users trying to reason about/debug their concurrent code, and developers of analysis tools, there are competing factors in favour of a general philosophy of determinism, even (especially?) in the case of "incorrect" code. Even for (Wasm-targetting) compilers, determinism can make it easier/less expensive for higher level languages to target Wasm constructs without a semantic mismatch. Wasm in general has favoured determinism, unless it imposes an unreasonable burden on implementers (like NaN bit-patterns).

EDIT: I can't believe I forgot to mention this, but the current debacle with C++11 relaxed atomics is a perfect example of how pushing for maximal non-determinism can compromise a language in unforeseen ways. Any potential fix would be "adding determinism", but the semantics can't be easily changed because people now rely on the performance characteristics.

I agree that concurrency in particular is a hotbed of inevitable non-determinism, but it seems like in this particular situation there is a reasonable deterministic (at the instruction level) and atomic semantics that people who have discussed this previously don't seem to mind implementing. @lars-t-hansen, has anything come up since the original discussion that causes problems for the atomic approaches sketched above?

conrad-watt commented 5 years ago

Having slept on this, I think I need to moderate some of the fervour of my last comment. If we need to weaken the semantics of racy grows, I think the sensible thing would be to simply make it entirely non-deterministic whether a racing read/write in the grow range succeeds or traps during the "commit period". The semantic story in this case would be similar to racy non-atomics in that there's a defined semantics, but, as you said, sensible programs should be avoiding the race altogether. There aren't many guarantees in the space between "atomic" and "entirely non-deterministic" that I could see a program/analysis usefully using.

@AndrewScheidecker I'm guessing that some of your suggestions here are prompted by your implementation experiences with WAVM. With the semantics you suggested/the one here, does anything change in the non-trap case performance/implementation-wise, or is this purely about the difficulty in getting the trap case to respect atomicity/consistency?

lars-t-hansen commented 5 years ago

What's currently implemented in Firefox (and available in Nightly behind the javascript.options.shared_memory flag in about:config) is roughly what I outlined before, namely, we take a lock for memory.grow and also for memory.size, and the SAB when returned via the buffer accessor has a never-changing length as far as JS is concerned, though the next call to buffer may return a different SAB with a larger length.

What we've not implemented is fixing anything up in the signal handler: it is possible for a thread to take an OOB trap because it races with a grow and then query current_memory to find out that the memory is large enough after all, or to query current_memory and access beyond that limit and not trap. But that's OK for now, since there's a race, and the program's view of the race is consistent once it knows that a race is possible.

However: We do not park the threads; and we use a single mprotect call to grow the memory and we have no control over the order in which pages are committed or whether these pages become visible to all threads in the same order or at the same time. If the OS does not play along then the program can observe behavior that appears inconsistent with the simple (if racy) semantics of "grow".

Of course all of this is purely for linearly growing memory, there's no complexity here for mmap, mprotect, shrinking, or similar.

conrad-watt commented 5 years ago

What we've not implemented is fixing anything up in the signal handler: it is possible for a thread to take an OOB trap because it races with a grow and then query current_memory to find out that the memory is large enough after all, or to query current_memory and access beyond that limit and not trap. But that's OK for now, since there's a race, and the program's view of the race is consistent once it knows that a race is possible.

So this seems to tie up the story regarding current_memory and an access.

The scenario I'm worried about (that I think you're referencing further in your comment) is where the size is indirectly observed. Imagine two sequenced accesses in one thread, guaranteed to be to the same page, racing with a grow_memory in another thread. If it's possible for the first access to succeed, but the second to trap with OOB, then grow_memory needs a weaker semantics. There are various flavours of this scenario with unordered/seqcst accesses. It sounds like in the implementation described above, even with seqcst accesses, this out-of-order trap observation is possible, depending on how the OS commits "real" pages under the hood (which may be smaller than Wasm pages).

AndrewScheidecker commented 5 years ago

I'm guessing that some of your suggestions here are prompted by your implementation experiences with WAVM.

Right. It sounds like WAVM is similar to Firefox with the exception that it just does an atomic load in memory.size instead of waiting for the mutex locked by memory.grow. I also implement shrinking a memory in a similar way to memory.grow, although it's only exposed as a host-side API instead of as a WASM instruction as well. I think adding that lock to memory.size does make sense: it provides more atomicity without much practical cost. The thing I'm less sure about is trying to make the mprotect call appear to be atomic.

WAVM already expects that it can use a signal handler/SEH to isolate out-of-bounds faults, so it's possible to add a lazy wait on the memory.grow mutex to the signal handler, and try to make the mprotect appear to be atomic. However, I'm concerned that it's hard to write tests for this, and very easy to implement incorrectly. It's a recipe for buggy implementations.

conrad-watt commented 5 years ago

WAVM already expects that it can use a signal handler/SEH to isolate out-of-bounds faults, so it's possible to add a lazy wait on the memory.grow mutex to the signal handler, and try to make the mprotect appear to be atomic. However, I'm concerned that it's hard to write tests for this, and very easy to implement incorrectly. It's a recipe for buggy implementations.

Yeah... and while it makes the abstract spec very neat, in such an implementation, the user would potentially see executions with very strange perf characteristics, where morally a trap has occurred (and can occur, in other interleavings).

I'm warming to a semantics where an access racing with a grow_memory non-deterministically traps/succeeds. This wouldn't change the expected behaviour when observing the length directly (because subsequent accesses would no longer be racing), just for the multiple racing accesses case. It would moderately complicate the spec, but it seems justifiable here. Maybe this is something that could be flagged up for discussion at TPAC.

binji commented 5 years ago

Maybe this is something that could be flagged up for discussion at TPAC.

I've added this as an item to the agenda planning issue: https://github.com/WebAssembly/meetings/issues/282

conrad-watt commented 5 years ago

To keep track of things here, there was a clear consensus at the TPAC meeting to move forward with the weaker semantics. This means that multiple accesses in one thread racing with another thread's memory.grow that are in-bounds only after the grow commits may independently succeed or trap; the success of one does not affect the success/failure of any other. This is true even of atomic operations.

For clarity, nothing has changed about the semantics of memory.size (which guarantees no subsequent OOB up to the bound it observes, requiring the lock implementation described above or some other synchronization).