WebAssembly / shared-everything-threads

A draft proposal for spawning threads in WebAssembly
Other
29 stars 1 forks source link

there should be separate acquire and release orderings #60

Open programmerjake opened 1 month ago

programmerjake commented 1 month ago

the current proposal only has options for seqcst and acqrel but not acquire or release. This is less optimal for AArch64 and probably other common non-x86 ISAs, because it generally doesn't have a combined acqrel encoding, but only acquire, only release, or seqcst (it also has relaxed, but that can be left for a future change). Likewise fences on most non-x86 ISAs likely have an acquire/release fence be the exact same instruction as a seqcst fence, making it much more expensive than an acquire or release fence.

AArch64 example: https://rust.godbolt.org/z/8on78ceKd RISC-V example: https://rust.godbolt.org/z/vE3Wfb3sM

notice how the seqcst and acqrel instructions are identical in the output assembly, but the acquire and/or release instructions are different:

example::SeqCst::fence::hbe3d52b8604a8791:
        dmb     ish
        ret

example::AcqRel::fence::hd84bc02b16096cfe:
        dmb     ish
        ret

example::Acquire::fence::h679889b32ada7438:
        dmb     ishld
        ret

example::Release::fence::h7d0e3ea44ef10df1:
        dmb     ish
        ret

example::SeqCst::fetch_add::ha419496bae44cd3a:
        ldaddal w1, w0, [x0]
        ret

example::AcqRel::fetch_add::h5591e17ade7f755b:
        ldaddal w1, w0, [x0]
        ret

example::Acquire::fetch_add::hbceb98eee686a1c4:
        ldadda  w1, w0, [x0]
        ret

example::Release::fetch_add::h783c91683a83f0d0:
        ldaddl  w1, w0, [x0]
        ret
tlively commented 1 month ago

Thanks! Yes, it looks like we need separate Acquire and Release orderings for use with the RMW ops. We're currently taking two bits of the alignment to encode memory orders, so if we use them to encode SeqCst, Acquire, Release, and AcqRel orderings, we won't have room for future extension.

A more cautious approach would be to use a single bit in the alignment field to signify that a separate ordering immediate follows, similar to how multi-memory uses a bit to signal that a memory index follows.

Alternatively, we might be able to give the RMW ops separate immediates for their load and store orderings like we have to do with cmpxchg (see #59). That would let us use a single encoding for Release, Acquire, and AcqRel like the explainer describes today.

@conrad-watt, how likely do you think it is that in the fullness of time we will need more than SeqCst, Acquire, Release, and AcqRel orderings? Will we ever need Relaxed, for example?

Right now I lean toward the solution where we use a single bit to introduce separate ordering immediates.

conrad-watt commented 1 month ago

An RMW like cmpxchg always conceptually has two ordering parameters - one associated with the load and one associated with the store. In the case that it appears to be "only" (load) acquire or "only" (store) release, the other parameter is effectively relaxed (store/load respectively), so until we have any ordering in the spec weaker than acquire/release, the weakest RMW we can provide is acquire-release. Technically we could allow acquire-seqcst and seqcst-release as separate orderings, but these don't seem to be used in practice, would be a pain to spec (and probably to handle in the engine) as separate cases, and also IIRC only compile slightly differently on older ARMv7. It's ok for us to think about an encoding with 2 ordering parameters under the hood (remembering that load release and store acquire are not valid / equivalent to relaxed) - I wasn't thinking in detail about this before this issue. If we go with a 2-parameter encoding, I would advocate for explicitly disallowing acquire-seqcst and seqcst-release as a conservative/complexity-saving measure.

WRT fences, it's mostly harmless to add acquire, release variants in addition to acquire-release for slightly more efficient codegen. I'm actually a little surprised that the release fence in the OP isn't compiling to dmb ishst, but the reason why is explained here (https://gcc.gnu.org/legacy-ml/gcc-patches/2014-03/msg00570.html). Fences really do just have one parameter, there is no such thing as an acquire-seqcst fence.

To briefly give some wider hot takes, I would currently be against adding relaxed ordering because it's caused such a mess in C++-land. If we want to go weaker that acquire-release in future there are intermediate alternatives like "load-store ordered" atomics with very close performance to relaxed (edit: link). I'd also be against weak RMWs because of their non-determinism (in some models, even in single-threaded code).

programmerjake commented 1 month ago

imo there should be the option for relaxed ordering or some close equivalent, as long as it's correct for a compiler to compile C/C++/Rust relaxed to that ordering, and as long as LLVM supports that ordering (since there are popular wasm engines that use LLVM).

to be perfectly honest, I would just use relaxed since that means that the conversion from C/C++ to wasm and from wasm to C can be lossless. iirc new C versions may be adjusting their version of relaxed anyway...

conrad-watt commented 1 month ago

iirc new C versions may be adjusting their version of relaxed anyway...

Well this is an argument against us adding relaxed now - we don't want to tailor ourselves to one definition that then changes underfoot afterwards.

imo there should be the option for relaxed ordering or some close equivalent, as long as it's correct for a compiler to compile C/C++/Rust relaxed to that ordering, and as long as LLVM supports that ordering (since there are popular wasm engines that use LLVM).

Personally I see load-store ordered atomics as (mostly) satisfying the above (while staying correct however C/C++ changes relaxed in future), but we've deferred some of the details of this debate while we work out acquire-release and big picture details like TLS.

conrad-watt commented 1 month ago

I should explicitly bring up that load-store ordered atomics are essentially "RC11 relaxed atomics", which have the strongest (in the sense of least relaxed) candidate semantics anyone has suggested as an alternative to the current state of C11 relaxed atomics.

programmerjake commented 1 month ago

iirc new C versions may be adjusting their version of relaxed anyway...

Well this is an argument against us adding relaxed now - we don't want to tailor ourselves to one definition that then changes underfoot afterwards.

in that case, I think it may be best to just encode one bit in the alignment field that means there's a following ordering byte, and the top nibble encodes one ordering and the bottom nibble encodes the other ordering (for cmpxchg which needs two). this way we have space for seqcst, acqrel, acquire, release, relaxed, consume (if that ever gets fixed), and any new fancy orderings that future C versions might add. 16 possible orderings ought to be enough for anyone...

conrad-watt commented 1 month ago

Yes, I agree that we should make sure we're leaving sufficient space in the encoding even if we can't fix all our choices now. The nibble idea is interesting but I'll have to defer to people who think about the binary format more! (@tlively?)

rossberg commented 1 month ago

While I agree it's best to leave some extension space, we should also do everything we can for Wasm not to grow a zoo of orderings in the future. In the context of its cross-platform nature, it is even more difficult to get any reliable performance benefit out of weaker semantics anyway.

tlively commented 1 month ago

Ok, so it sounds like we want separate encodings of acquire and release at least for use with fences. Packing two 4-bit ordering encodings into a single byte sounds fine to me; instructions that take just one ordering can use only the low 4 bits.

For now we will have seqcst, acquire, and release, orderings. Using release for a load ordering or acquire for a store ordering will not be allowed. I am happy to disallow acquire-seqcst and seqcst-release as well.

I'll make a PR.

programmerjake commented 1 month ago

for things like fetch_add that traditionally take only one ordering, we'll also want an acq-rel ordering, unless the plan is to make it take two orderings?

tlively commented 1 month ago

I think we can consistently have every RMW op take two orderings (e.g. acquire+release) and add restrictions so that each instruction can only get the orderings that make sense for it.

tlively commented 1 month ago

Oh, but cmpxchg is special because its two orderings are for the success and fail cases, not for the read and write. If we always have separate ordering immediates for reads and writes, then it would seem that cmpxchg should have four orderings. Yikes!

I think we should get clarity here by listing out all the combinations of orderings we actually want to support for 1) cmpxchg and 2) other RMW ops. Once we have the full list, we can consider the best way to encode them. @conrad-watt, would you be able to help out with this?

conrad-watt commented 1 month ago

Oh, but cmpxchg is special because its two orderings are for the success and fail cases, not for the read and write. If we always have separate ordering immediates for reads and writes, then it would seem that cmpxchg should have four orderings. Yikes!

Actually, cmpxchg can also fit into the "read and write" view of things, and I believe this is closer to how hardware has to do things with paired exclusive instructions. I'm not sure why C/C++ does things by "failure/success" case - I think when actually compiling to a hardware sequence they'd need to take the stronger of the two and use this as both the "read" and "write" component of the compilation scheme. I'll shoot an email to someone who's more versed in C/C++.

If we're individually enumerating all the cases we care about, the most important are "acquire-release everywhere", and "seqcst everywhere". There are some niche use-cases for mixed-consistency accesses, but mostly for "acquire-release or weaker", and since we're not offering anything weaker than acquire-release, we can defer thinking about this so long as there's forward-compatible space in the encoding. The compilation scheme stays sound anyway because any missing "mixed" accesses can just be bumped up to the next-strongest "uniform" access.

So I'd propose that for fences we support 0x00 - seqcst 0x01 - acquire 0x02 - release 0x03 - acquire-release

It's nice to establish the precedent that acquire-release is encoded as acquire AND release. Also, "acquire fence" is used in practice and compiles differently from an "acquire-release fence", so we get immediate value from having the distinction.

for instructions that could take two orderings (RMWs including xchg and cmpxchg), for now we just support 0x00 - seqcst everywhere 0x33 - acquire-release everywhere

This lets us defer the question of exactly how to interpret cmpxchg vs other RMW ops while getting 99% value. Until we have an ordering weaker than acquire/release, it's hard to do better anyway.

For instructions that take one ordering, we just support 0x00 - seqcst 0x03 - acquire-release

In theory we could allow 0x01 and 0x02, but this seems to create some fussy situations - e.g. if a load is marked with 0x02, there's no such thing as a "release load", so should this be an error?

Side-note - for fences and single-order ops, we could choose to duplicate the order parameter - e.g. 0x33 for acquire-release instead of 0x03, but this might lose us flexibility if we later want to use the top nibble for some other flags we're not anticipating.

programmerjake commented 1 month ago

for instructions that could take two orderings (RMWs including xchg and cmpxchg), for now we just support 0x00 - seqcst everywhere 0x33 - acquire-release everywhere

This lets us defer the question of exactly how to interpret cmpxchg vs other RMW ops while getting 99% value. Until we have an ordering weaker than acquire/release, it's hard to do better anyway.

in my experience atomic RMW ops with only acquire or only release are more common than fences. Remember that on several notable ISAs (AArch64 iirc and RISC-V), an acquire-release RMW is usually the exact same instruction as the corresponding seqcst RMW, so we lose most of the benefit here. You can't even use a fence combined with RMW to make an acquire or release RMW, since we don't yet have relaxed RMWs!

programmerjake commented 1 month ago

This lets us defer the question of exactly how to interpret cmpxchg vs other RMW ops while getting 99% value. Until we have an ordering weaker than acquire/release, it's hard to do better anyway.

I see less of a problem with just saying we don't yet know how to handle acquire or release for cmpxchg, so only seqcst or acquire/release are supported for now -- to be fixed very soon. This allows us to still get the benefit for all other RMW ops without having to wait

conrad-watt commented 1 month ago

I see less of a problem with just saying we don't yet know how to handle acquire or release for cmpxchg, so only seqcst or acquire/release are supported for now -- to be fixed very soon. This allows us to still get the benefit for all other RMW ops without having to wait

The fundamental issue isn't unique to cmpxchg - this proposal currently isn't considering any memory ordering weaker than acquire/release, so we can't spec any RMW op with a relaxed read/write component (this is what an RMW that is "just" acquire or "just" release" is implicitly doing).

This is something we can revisit once we've solved our more fundamental design problems around TLS, shared functions, and JS interaction. Right now the Web only offers SeqCst atomics, so acquire/release is convenient as a relatively uncontroversial way for us to (1) establish a precedent that relaxed memory orderings can be supported on the Web and (2) ensure there's space in the encoding for these orderings. I see it as a foot in the door for the more challenging conversations that we'll need to have later.