WebAssembly / threads

Threads and Atomics in WebAssembly
https://webassembly.github.io/threads/
Other
705 stars 50 forks source link

i64.atomic.wait has no matching futex syscall on Linux #135

Open programmerjake opened 5 years ago

programmerjake commented 5 years ago

I noticed that the i64.atomic.wait instruction can't be translated directly to the futex syscall on Linux, since Linux only supports 32-bit futex operations. I'm concerned that non-web embeddings on Linux will need to implement their own wait/notify functionality instead of being able to use Linux's directly.

Web embeddings will have to implement their own wait/notify functionality anyway to support primitives like Atomics.waitAsync

Related:

7

StackOverflow: Linux futex for 64-bits

conrad-watt commented 5 years ago

I believe, based on the discussion in https://github.com/WebAssembly/threads/issues/7, that it's possible to implement i64.atomic.wait as a 64-bit atomic read (to confirm the expected value), conditionally followed by a linux 32-bit futex operation on the same address, expecting the low 32-bits of the value read (modulo endianness).

programmerjake commented 5 years ago

that doesn't actually work, there can be a lost wakeup:

assuming little endian:

thread 1 thread 2
write 0x100000000
start i64.atomic.wait 0x100000000
read & compare
write 0x0
futex wake
futex wait
conrad-watt commented 5 years ago

Ah, true. Sorry for the rookie mistake.

lars-t-hansen commented 5 years ago

I don't think you can use raw futexes even for the 32-bit wait. The wait/notify mechanism has some guarantees, such as ordered wakeup, that raw futexes do not seem to provide.

binji commented 5 years ago

Right, I don't think it was ever the intention that you could directly replace these instructions with raw futex syscalls.

rianhunter commented 4 years ago

Another modest plea to drop memory.atomic.wait64 (f.k.a. i64.atomic.wait) for the sake of allowing a straightforward futex()-based implementation. An additional modest plea to ease the requirements of memory.atomic.wait32 to allow a futex()-based implementation, in particular allowing spurious wakeups. Some rationale:

Additionally, allowing spurious wakeups eases low-level implementations of memory.atomic.wait32 in general, since the sleep condition is only on the initial check of the atomic. Disallowing spurious wakeups makes more sense when the semantics of the wait call are continuously dependent on the value of the atomic, e.g. in C++20 atomic_wait(). Low-level implementations that have no way to prevent spurious wakeups are unable to decide if they should continue to sleep or not after one occurs (even if they weren't notified).

I believe the ordering issue can be resolved by placing manual seqcst fences around the futex calls, though syscalls already provide a memory barrier so that would be redundant. @lars-t-hansen could you elaborate what you mean by "ordered wakeup"?

tlively commented 4 years ago

I'm also curious about why we decided to disallow spurious wakeups. https://github.com/WebAssembly/design/pull/1019#discussion_r112070603 suggests it is because JS doesn't allow them, but I don't know why that is.

lars-t-hansen commented 4 years ago

Interop with JS Atomics doesn't require 64-bit wait

JS supports 64-bit TypedArray in wait and notify: https://tc39.es/ecma262/#sec-atomics.wait handles BigInt64Array, and notify is size-agnostic but allows that type of view.

It was never (except maybe the very early days, when we were starting with something similar to what I believe NaCl had) intended for futexes to be able to directly express the semantics of wait and notify.

The prohibition on spurious wakeups and other design artifacts are there to push complexity into the implementation and away from the user (after all this is/was JS). Sticking to seq_cst for the first version is similarly motivated. Simply adding shared memory and atomics to JS was highly controversial, and compromises that reduced nondeterminism at the cost of some performance were necessary.

The intent of the ordered wakeup rule is fairness and (again) managing nondeterminism: if an observer can determine that one thread blocked in a wait before another, then waking one thread on that location will for sure wake the first one.

lars-t-hansen commented 4 years ago

I should add: Compatibility with JS is a big deal for wasm on the web and since JS is still much more important than wasm, JS controls the agenda to some extent. But that fact does not prevent there from being additional instructions that work better in other environments. IMO we should be very open to proposals that add instructions that allow for better performance at the expense of nondeterminism across systems and implementations. But I do think those should be framed as new proposals.

The situation is similar to SIMD: At the moment, wasm SIMD has no nondeterminism and SIMD instructions work the same way (with some perf compromises) on different hardware platforms. But it is expected that the next version of wasm SIMD will have instructions that work (and are known to work) well on some hardware and less well on other hardware, or instructions that have variable amount of imprecision and therefore add nondeterminism, all in the interest of performance.

rianhunter commented 4 years ago

@lars-t-hansen your prompt response is appreciated and well understood. While it's unfortunate that the semantics of JS atomics wait/notify diverges from kernel-level wait/notify primitives for the sake of eliminating non-determinism, I understand the design constraint of maintaining semantic compatibility with JS.

@tlively Here's one example (of which you may or may not be aware) where disallowing spurious wakeups in the primitive comes in handy:

// thread 1
val = atomic_load_seqcst(&var);
if (!cond(val))
    atomic_wait(&var, val);
assert(cond(val));

// thread 2
set_condition_seqcst(&var);
atomic_notify(&var);

This avoids a recheck of val after the atomic_wait() in thread 1, which is slightly more efficient if the atomic_wait() primitive also efficiently avoids spurious wakeups. I think the tradeoff here is slightly dubious but that's only because I can't think of other examples where disallowing spurious wakeups simplifies user-level code. Would love to see more examples of algorithms that are much expressed much more succintly and/or efficiently with a deterministic/non-spurious wait, if only for my own edification.

tlively commented 4 years ago

@rianhunter I also depend on the absence of spurious wakes in the code that initializes memory when threads are enabled: https://github.com/llvm/llvm-project/blob/adcd02683856c30ba6f349279509acecd90063df/lld/wasm/Writer.cpp#L728-L768. That's not user-level code, but it would be expressible in C. However, that code could be trivially made to handle spurious wakeups by just waiting in a loop.

lars-t-hansen commented 4 years ago

In general I think that there are some warts in the current JS design. Disallowing spurious wakeups at the JS API level just pushes that checking into the implementation, and it's not obvious that it really simplifies user code all that much -- user code has to be aware of the complicated semantics of waking up in any case. (Though it's probable that fairness is a useful property and it's possible fairness and no-spurious-wakeup are connected.) It's been pointed out that the return value from notify is of questionable utility, if any (https://github.com/tc39/ecma262/issues/1492). And isLockFree() should almost certainly never have been part of the proposal (and is not in wasm).

rianhunter commented 4 years ago

Disallowing spurious wakeups at the JS API level just pushes that checking into the implementation, and it's not obvious that it really simplifies user code all that much

Just for clarity's sake, the reason to allow spurious wakeups in the primitive is that not all systems have an efficient way of preventing spurious wakeups. This is less applicable to big systems and more applicable to small low-level / embedded systems. The impact consideration on user code isn't that allowing spurious wakeups simplifies user code but rather that it doesn't significantly complicate user code (e.g. @tlively's example).

rodgert commented 2 years ago

Another modest plea to drop memory.atomic.wait64 (f.k.a. i64.atomic.wait) for the sake of allowing a straightforward futex()-based implementation. An additional modest plea to ease the requirements of memory.atomic.wait32 to allow a futex()-based implementation, in particular allowing spurious wakeups. Some rationale:

I am the implementer of libstdc++'s atomic wait. libstdc++ currently supports using futex() on platforms which have it, I also anticipate supporting platforms that have __ulock_wait/wake in a future release (possibly GCC13). For platforms which do not have such a mechanism, there is a fallback implementation built on mutex/condvar. For those types which fit into the underlying platform's wait/notify mechanism the futex() is used directly. If the atomic type T does not fit, it is proxied through another waited address. There are limited number of such addresses and they are shared, so there is an additional potential for spurious wakeups in this case.

The standard allows for spurious wakeups and there are no guarantees on ordering.

  • This will potentially improve performance for futex()-based non-web embeddings
  • 64-bit wait doesn't seem to be widespread in existing native code (C/C++)
  • Interop with JS Atomics doesn't require 64-bit wait

Additionally, allowing spurious wakeups eases low-level implementations of memory.atomic.wait32 in general, since the sleep condition is only on the initial check of the atomic. Disallowing spurious wakeups makes more sense when the semantics of the wait call are continuously dependent on the value of the atomic, e.g. in C++20 atomic_wait(). Low-level implementations that have no way to prevent spurious wakeups are unable to decide if they should continue to sleep or not after one occurs (even if they weren't notified).

I believe the ordering issue can be resolved by placing manual seqcst fences around the futex calls, though syscalls already provide a memory barrier so that would be redundant. @lars-t-hansen could you elaborate what you mean by "ordered wakeup"?