Open programmerjake opened 1 month ago
Thanks! Yes, it looks like we need separate Acquire and Release orderings for use with the RMW ops. We're currently taking two bits of the alignment to encode memory orders, so if we use them to encode SeqCst, Acquire, Release, and AcqRel orderings, we won't have room for future extension.
A more cautious approach would be to use a single bit in the alignment field to signify that a separate ordering immediate follows, similar to how multi-memory uses a bit to signal that a memory index follows.
Alternatively, we might be able to give the RMW ops separate immediates for their load and store orderings like we have to do with cmpxchg (see #59). That would let us use a single encoding for Release, Acquire, and AcqRel like the explainer describes today.
@conrad-watt, how likely do you think it is that in the fullness of time we will need more than SeqCst, Acquire, Release, and AcqRel orderings? Will we ever need Relaxed, for example?
Right now I lean toward the solution where we use a single bit to introduce separate ordering immediates.
An RMW like cmpxchg
always conceptually has two ordering parameters - one associated with the load and one associated with the store. In the case that it appears to be "only" (load) acquire
or "only" (store) release
, the other parameter is effectively relaxed
(store/load respectively), so until we have any ordering in the spec weaker than acquire
/release
, the weakest RMW we can provide is acquire-release
. Technically we could allow acquire-seqcst
and seqcst-release
as separate orderings, but these don't seem to be used in practice, would be a pain to spec (and probably to handle in the engine) as separate cases, and also IIRC only compile slightly differently on older ARMv7. It's ok for us to think about an encoding with 2 ordering parameters under the hood (remembering that load release
and store acquire
are not valid / equivalent to relaxed) - I wasn't thinking in detail about this before this issue. If we go with a 2-parameter encoding, I would advocate for explicitly disallowing acquire-seqcst
and seqcst-release
as a conservative/complexity-saving measure.
WRT fences, it's mostly harmless to add acquire
, release
variants in addition to acquire-release
for slightly more efficient codegen. I'm actually a little surprised that the release
fence in the OP isn't compiling to dmb ishst
, but the reason why is explained here (https://gcc.gnu.org/legacy-ml/gcc-patches/2014-03/msg00570.html). Fences really do just have one parameter, there is no such thing as an acquire-seqcst
fence.
To briefly give some wider hot takes, I would currently be against adding relaxed
ordering because it's caused such a mess in C++-land. If we want to go weaker that acquire-release
in future there are intermediate alternatives like "load-store ordered" atomics with very close performance to relaxed
(edit: link). I'd also be against weak RMWs because of their non-determinism (in some models, even in single-threaded code).
imo there should be the option for relaxed ordering or some close equivalent, as long as it's correct for a compiler to compile C/C++/Rust relaxed to that ordering, and as long as LLVM supports that ordering (since there are popular wasm engines that use LLVM).
to be perfectly honest, I would just use relaxed since that means that the conversion from C/C++ to wasm and from wasm to C can be lossless. iirc new C versions may be adjusting their version of relaxed anyway...
iirc new C versions may be adjusting their version of relaxed anyway...
Well this is an argument against us adding relaxed now - we don't want to tailor ourselves to one definition that then changes underfoot afterwards.
imo there should be the option for relaxed ordering or some close equivalent, as long as it's correct for a compiler to compile C/C++/Rust relaxed to that ordering, and as long as LLVM supports that ordering (since there are popular wasm engines that use LLVM).
Personally I see load-store ordered atomics as (mostly) satisfying the above (while staying correct however C/C++ changes relaxed in future), but we've deferred some of the details of this debate while we work out acquire-release and big picture details like TLS.
I should explicitly bring up that load-store ordered atomics are essentially "RC11 relaxed atomics", which have the strongest (in the sense of least relaxed) candidate semantics anyone has suggested as an alternative to the current state of C11 relaxed atomics.
iirc new C versions may be adjusting their version of relaxed anyway...
Well this is an argument against us adding relaxed now - we don't want to tailor ourselves to one definition that then changes underfoot afterwards.
in that case, I think it may be best to just encode one bit in the alignment field that means there's a following ordering byte, and the top nibble encodes one ordering and the bottom nibble encodes the other ordering (for cmpxchg which needs two). this way we have space for seqcst, acqrel, acquire, release, relaxed, consume (if that ever gets fixed), and any new fancy orderings that future C versions might add. 16 possible orderings ought to be enough for anyone...
Yes, I agree that we should make sure we're leaving sufficient space in the encoding even if we can't fix all our choices now. The nibble idea is interesting but I'll have to defer to people who think about the binary format more! (@tlively?)
While I agree it's best to leave some extension space, we should also do everything we can for Wasm not to grow a zoo of orderings in the future. In the context of its cross-platform nature, it is even more difficult to get any reliable performance benefit out of weaker semantics anyway.
Ok, so it sounds like we want separate encodings of acquire
and release
at least for use with fences. Packing two 4-bit ordering encodings into a single byte sounds fine to me; instructions that take just one ordering can use only the low 4 bits.
For now we will have seqcst
, acquire
, and release
, orderings. Using release
for a load ordering or acquire
for a store ordering will not be allowed. I am happy to disallow acquire-seqcst
and seqcst-release
as well.
I'll make a PR.
for things like fetch_add that traditionally take only one ordering, we'll also want an acq-rel ordering, unless the plan is to make it take two orderings?
I think we can consistently have every RMW op take two orderings (e.g. acquire
+release
) and add restrictions so that each instruction can only get the orderings that make sense for it.
Oh, but cmpxchg
is special because its two orderings are for the success and fail cases, not for the read and write. If we always have separate ordering immediates for reads and writes, then it would seem that cmpxchg
should have four orderings. Yikes!
I think we should get clarity here by listing out all the combinations of orderings we actually want to support for 1) cmpxchg and 2) other RMW ops. Once we have the full list, we can consider the best way to encode them. @conrad-watt, would you be able to help out with this?
Oh, but
cmpxchg
is special because its two orderings are for the success and fail cases, not for the read and write. If we always have separate ordering immediates for reads and writes, then it would seem thatcmpxchg
should have four orderings. Yikes!
Actually, cmpxchg
can also fit into the "read and write" view of things, and I believe this is closer to how hardware has to do things with paired exclusive instructions. I'm not sure why C/C++ does things by "failure/success" case - I think when actually compiling to a hardware sequence they'd need to take the stronger of the two and use this as both the "read" and "write" component of the compilation scheme. I'll shoot an email to someone who's more versed in C/C++.
If we're individually enumerating all the cases we care about, the most important are "acquire-release everywhere", and "seqcst everywhere". There are some niche use-cases for mixed-consistency accesses, but mostly for "acquire-release or weaker", and since we're not offering anything weaker than acquire-release, we can defer thinking about this so long as there's forward-compatible space in the encoding. The compilation scheme stays sound anyway because any missing "mixed" accesses can just be bumped up to the next-strongest "uniform" access.
So I'd propose that for fences we support
0x00
- seqcst
0x01
- acquire
0x02
- release
0x03
- acquire-release
It's nice to establish the precedent that acquire-release is encoded as acquire AND release. Also, "acquire fence" is used in practice and compiles differently from an "acquire-release fence", so we get immediate value from having the distinction.
for instructions that could take two orderings (RMWs including xchg
and cmpxchg
), for now we just support
0x00
- seqcst everywhere
0x33
- acquire-release everywhere
This lets us defer the question of exactly how to interpret cmpxchg
vs other RMW ops while getting 99% value. Until we have an ordering weaker than acquire/release
, it's hard to do better anyway.
For instructions that take one ordering, we just support
0x00
- seqcst
0x03
- acquire-release
In theory we could allow 0x01
and 0x02
, but this seems to create some fussy situations - e.g. if a load is marked with 0x02
, there's no such thing as a "release load", so should this be an error?
Side-note - for fences and single-order ops, we could choose to duplicate the order parameter - e.g. 0x33
for acquire-release instead of 0x03
, but this might lose us flexibility if we later want to use the top nibble for some other flags we're not anticipating.
for instructions that could take two orderings (RMWs including
xchg
andcmpxchg
), for now we just support0x00
- seqcst everywhere0x33
- acquire-release everywhereThis lets us defer the question of exactly how to interpret
cmpxchg
vs other RMW ops while getting 99% value. Until we have an ordering weaker thanacquire/release
, it's hard to do better anyway.
in my experience atomic RMW ops with only acquire or only release are more common than fences. Remember that on several notable ISAs (AArch64 iirc and RISC-V), an acquire-release RMW is usually the exact same instruction as the corresponding seqcst RMW, so we lose most of the benefit here. You can't even use a fence combined with RMW to make an acquire or release RMW, since we don't yet have relaxed RMWs!
This lets us defer the question of exactly how to interpret
cmpxchg
vs other RMW ops while getting 99% value. Until we have an ordering weaker thanacquire/release
, it's hard to do better anyway.
I see less of a problem with just saying we don't yet know how to handle acquire or release for cmpxchg, so only seqcst or acquire/release are supported for now -- to be fixed very soon. This allows us to still get the benefit for all other RMW ops without having to wait
I see less of a problem with just saying we don't yet know how to handle acquire or release for cmpxchg, so only seqcst or acquire/release are supported for now -- to be fixed very soon. This allows us to still get the benefit for all other RMW ops without having to wait
The fundamental issue isn't unique to cmpxchg
- this proposal currently isn't considering any memory ordering weaker than acquire/release, so we can't spec any RMW op with a relaxed read/write component (this is what an RMW that is "just" acquire or "just" release" is implicitly doing).
This is something we can revisit once we've solved our more fundamental design problems around TLS, shared functions, and JS interaction. Right now the Web only offers SeqCst atomics, so acquire/release is convenient as a relatively uncontroversial way for us to (1) establish a precedent that relaxed memory orderings can be supported on the Web and (2) ensure there's space in the encoding for these orderings. I see it as a foot in the door for the more challenging conversations that we'll need to have later.
the current proposal only has options for seqcst and acqrel but not acquire or release. This is less optimal for AArch64 and probably other common non-x86 ISAs, because it generally doesn't have a combined acqrel encoding, but only acquire, only release, or seqcst (it also has relaxed, but that can be left for a future change). Likewise fences on most non-x86 ISAs likely have an acquire/release fence be the exact same instruction as a seqcst fence, making it much more expensive than an acquire or release fence.
AArch64 example: https://rust.godbolt.org/z/8on78ceKd RISC-V example: https://rust.godbolt.org/z/vE3Wfb3sM
notice how the seqcst and acqrel instructions are identical in the output assembly, but the acquire and/or release instructions are different: