What should be the semantics/constraints of hypothetical masked memory accesses?

lemaitre commented 4 years ago

I believe masked memory accesses are crucial, but I would like to separate two aspects:

Do we really need them?
How would they work?

The goal of this issue is to address the question "How would they work?" assuming we want them. I leave the question "Do we really need them?" for issue #7.

There is 4 main operations:

aligned masked load
unaligned masked load
aligned masked store
unaligned masked store

There also could be variants for certain mask shapes like loadOnlyN or storeOnlyN. Some of these are explored in this paper: http://www.cosenza.eu/papers/PohlSCOPES18.pdf

Aligned masked load

As long as the SIMD width is smaller than a page, it would be possible to just use a regular aligned load, and select only the active lanes specified by the mask.

No hardware support is required to have this operation efficient.

Unaligned masked load

Unaligned masked load are a different beast. One may think it would be possible to do the same than for aligned masked load, but an aligned load might partially fault. This is the case when the load crosses page boundary.

If the memory fault happens for active elements, there is no problem as the fault is legitimate, but the fault might also happen only on inactive elements. In that case, the fault should be ignored and the load succeed.

There is 3 way to workaround this issue:

There is native hardware support for it. In that case, the fault are already properly handled.
We emulate the masked load with a chain conditional scalar loads and insert.
We perform full-width unaligned load and rely on a custom, internal, signal handler. This signal handler would check for every memory fault if the unaligned load instruction was actually a masked one (when going back to WASM). If the load was not masked, or if the fault meets active elements, the fault should be propagated as usual, otherwise, it should be ignored and the execution continues with a partial load, and inactive elements set to zero.

I think the signal handler would be the most efficient one as faults would be rare, even if masked loads are intensively used. This would require an internal signal handler, but I have the impression that WASM engines already do have signal handlers for internal use.

Such a signal handler would require a way to check if the load is actually masked, and in that case, where the mask is actually stored. This could be done either with a table of all unaligned masked loads of the program, or a specific nop sequence just before the load instruction.

The scalar emulation could be done in a library function to avoid code bloat.

Aligned masked store

A mask store should leave the masked out values the same in memory (they would not be modified).

This could be done in the following ways:

Native hardware support (eg: available on all types on SSE2, on 32- and 64-bit types on AVX2).
Emulation using tighter stores. This would be useful to implement 8- and 16-bit masked stores on AVX2. You could just split the vector in two, and call the SSE2 masked store on both.
Scalar emulation with a chain of extract and conditional scalar stores.
Load at the store memory location, blend it with the to-be-stored vector according to the mask, and store it back.

Masked stores are, after all, pretty well supported on x86 architectures, and the emulation with tighter stores on AVX2 would be pretty close to an hypothetical hardware instruction. So overhead on x86 here would be minimal.

The "load-blend-store" would also be quite efficient, but has a risk of race-condition if another thread (or an interrupt handler) stores elements at the position of inactive elements of the first thread. So this option should probably be avoided in mutlithreaded context. Anyway, such a situation would lead to false-sharing and would be detrimental for performance. To be noted that false-sharing is only a performance concern and does not affect correctness.

On Neon, the "load-blend-store" can be modified by "load acquire-blend-store conditional" loop to solve the race-condition issue.

Unaligned masked store

Unaligned masked stores are almost the same than aligned ones. The only difference would be that a "load-blend-store" emulation would require the same signal-handler trick than unaligned masked loads.

`loadOnlyN`, `storeOnlyN`

In case we have scalar emulations of masked operations, such an emulation can be more efficient if the mask has a special shape like "the first N lanes are actives, the remaining are inactive". In that case, we could avoid multiple ifs and have a jump table. On Neon, this could be pretty efficient as there are instructions for the sequences "load scalar-insert" and "extract-store scalar" and because all the instructions have the same size. In that case, we could have a computed goto with minimal size (<~20 instructions for 8-bit elements, much less for 32-bit elements).

Such a loadOnlyN or storeOnlyN could either be an explicit WASM instruction, or just an optimization that the engine could do when it recognizes the shape of the mask (with some "metadata folding" on masks).

Extra considerations

Maybe we want just some of them like only the aligned ones (I don't recommend it, but possible).

We could also say that the program may fault if the load crosses a page boundary even if the fault occurs only on inactive elements. In that case, it would be the responsibility of the end-user to ensure the 2 pages are actually allocated. I also don't recommend it, but you might have some ideas to make it work without much complexity on the user side.

lars-t-hansen commented 4 years ago

About read-modify-write for masked store: as the wasm memory model is still a work in progress (hi @conrad-watt, @rossberg) it's probably premature to say what will be allowed. The EcmaScript memory model explicitly disallows this as an implementation technique for its current data types, though it does allow floating point loads and stores to tear at byte boundaries. EcmaScript is relevant because we want the wasm and EcmaScript memory models to be a single model, if possible; on the other hand, EcmaScript is unlikely ever to have SIMD data, so there's some freedom here.

As we're unlikely ever to have atomic SIMD data of any size it seems that we have some leeway here to choose a model that works for good implementations, though it probably must be a good fit for the existing memory model's formal framework.

conrad-watt commented 4 years ago

Trying to get my head around this has really sent me down a rabbit hole.

~It seems that the fixed-size SIMD proposal has some pretty robust alignment requirements.~ Is there an example you can show me to help me understand why unaligned accesses are necessary here?

Is it possible we would want to vectorise scalar Wasm code as a Wasm->Wasm transformation (e.g. in wasm-opt), using one of these operations? Allowing load-blend-store might make this harder (since the scalar Wasm wouldn't have the same race condition).

lemaitre commented 4 years ago

@lars-t-hansen

The EcmaScript memory model explicitly disallows this as an implementation technique for its current data types, though it does allow floating point loads and stores to tear at byte boundaries.

I think tearing at byte (or element) boundary would be ok. In fact, I don't think (masked) accesses on AVX512 or SVE are atomic, so tearing might be unavoidable here.

However, I don't get your point about EcmaScript memory model. Does EcmaScript have any sort of partial/masked memory accesses? If not, why do you say that it explicitly disallows this?

Anyway, the "load-blend-store" is one way to emulate the masked store, but not the only one. The scalar emulation has not this race-condition problem, for instance. The question is then: do we prefer fast masked store on architectures without native support (eg: Neon), or do we prefer race-condition freedom?

I would personally prefer race-condition freedom.

@conrad-watt

It seems that the fixed-size SIMD proposal has some pretty robust alignment requirements. Is there an example you can show me to help me understand why unaligned accesses are necessary here?

As far as I understand, fixed-width SIMD in WASM has pretty loose alignment requirements (alignment is just an hint). But the thing is, with flexible vectors, you don't know in advance the architectural alignment of SIMD registers. So you will need unaligned memory accesses more than with fixed-width. And the larger the vector, the more likely the addresses will not be aligned in practice. This makes it more important to have unaligned accesses with flexible vectors than with fixed-width SIMD.

Now, we still might want to forbid unaligned masked accesses and keep only unmasked unaligned and masked aligned memory accesses to simplify implementations. I don't really have any example where it is useful, though.

Is it possible we would want to vectorise scalar Wasm code as a Wasm->Wasm transformation (e.g. in wasm-opt), using one of these operations? Allowing load-blend-store might make this harder (since the scalar Wasm wouldn't have the same race condition).

I'm not sure we would want a WASM->WASM vectorization, but a C->WASM vectorization would be interesting. It would probably be as hard to detect race conditions in C than in WASM, so the issue remains.

This argument is the one that makes me prefer scalar emulation over "load-blend-store" when there is no hardware support.

lemaitre commented 4 years ago

I just found this paper: http://www.cosenza.eu/papers/PohlSCOPES18.pdf They explore a 3rd way to emulate masked stores on ARM: a "load acquire-blend-store conditional" loop. This does not appear to be faster than scalar emulated masked store, but it is interesting nonetheless.

They also talk about unaligned masked loads, making this paper even more relevant for this discussion.

conrad-watt commented 4 years ago

As far as I understand, fixed-width SIMD in WASM has pretty loose alignment requirements (alignment is just an hint).

Ah, you're right. I misunderstood.

I'm not sure we would want a WASM->WASM vectorization, but a C->WASM vectorization would be interesting. It would probably be as hard to detect race conditions in C than in WASM, so the issue remains.

I'm still trying to catch up on how vectorisation works, but could something analogous to the following be a real optimisation?

Thread 1
// x is a 32-bit int array
x[0] = 3
x[1] = 5
x[3] = 7

Gets transformed/compiled into

Thread 1
masked store the vector <3,5,?,7> into x[0-3], using mask <1,1,0,1>

lars-t-hansen commented 4 years ago

However, I don't get your point about EcmaScript memory model. Does EcmaScript have any sort of partial/masked memory accesses?

Nothing that's masked or partial per se. There are issues around architectures that don't have byte-granularity accesses; for example, the first Alpha had to do byte accesses as rmw.

If not, why do you say that it explicitly disallows this?

I'm quoting the spec, specifically the third bullet of note 2 here: https://tc39.es/ecma262/#sec-shared-memory-guidelines. I don't remember the details - Conrad might - but this may just be a consequence of the details in the model that allow the race semantics to work portably.

lemaitre commented 4 years ago

@conrad-watt Your example could work for fixed-size SIMD, but with flexible vectors, as you don't know the size of the vector, such transformations would be hard to do automatically in practice.

A better example would be:

for (int i = 0; i < n; ++i) {
  auto a = A[i];
  if (a > 0) {
    B[i] = a;
  }
}

Autovectorization is primarily for loops (even more so with flexible vectors).

@lars-t-hansen

There are issues around architectures that don't have byte-granularity accesses; for example, the first Alpha had to do byte accesses as rmw.

That makes sense, thanks for the clarification. In that case, yes the byte access can be viewed as masked word access.

Therefore, I would recommend to forbid the "load-blend-store" sequence to emulate a masked store in order to not introduce data races. We could still allow it if the WASM engine can prove it does not introduce a data race (for example, if the engine is single threaded).

conrad-watt commented 4 years ago

@lemaitre does my example below make sense? I believe it supports forbidding the load-blend-store implementation (for shared memories).

Consider the following code where k is a global variable with some fixed (but not statically known) value:

Thread 1:
for (int i = 0; i < n; ++i) {
  if (i != k) {
    B[i] = A[i];
  }
}

Thread 2:
B[k] = 1;

This is a fine C program. It's guaranteed by the C memory model that if a subsequent thread synchronizes with both thread 1 and 2, then it will observe B[k] = 1. However, if this program is compiled to Wasm, and Thread 1 is optimised using a masked vector store, if the load-blend-store approach is allowed, it's possible for the subsequent thread to read B[k] = <initial value> instead.

This is the concretisation of @lemaitre's "data-race" concern - but even without having to classify data-races, we can see that the compilation scheme isn't sound if load-blend-store is allowed.

I think there's a similar logic in forbidding this implementation technique in the JS spec. In the counter-example above, if Thread 2's B[k] = 1 was implemented as a larger (non-atomic) RMW, then you could read B[k+1] in the subsequent synchronized thread and get a value that isn't allowed by C.

jan-wassenberg commented 4 years ago

@lemaitre:

I would personally prefer race-condition freedom.

Does x86 guarantee its masked load/store instructions are race-condition free?

@conrad-watt:

This is a great example of why I believe masked loads to be a dangerous/misleading abstraction: they attempt to promise something that isn't actually efficient for current hardware.

If we instead have applications write out the load-blend-store manually, we see that they are writing the entire memory range, and thus thread 2 has a data race. Tools such as tsan will detect this.

If we rely on the fact that single-element stores are atomic (which is not even guaranteed on all platforms, especially if unaligned), and then provide masked_store() = loop of conditional scalar stores, I do not see what benefit that brings vs. users just writing that loop directly. It would at least make more visible how expensive this is going to be.

lemaitre commented 4 years ago

@conrad-watt Yes, this is one example that would show the problem. Another one could be:

// Thread 1: i0 = 0, i1 = 100
// Thread 2: i0 = 100, i1 = 200
for (int i = i0; i < i1; ++i) {
  A[i] = f(i);
}

@jan-wassenberg

Does x86 guarantee its masked load/store instructions are race-condition free?

That's a good question. The closest I could find is Intel pseudo code where the store is per element within an if, or it explicitely says *DEST[i+31:i] remains unchanged*

VMASKMOVPS - 128-bit store
IF (SRC1[31]) DEST[31:0] := SRC2[31:0]
IF (SRC1[63]) DEST[63:32] := SRC2[63:32]
IF (SRC1[95]) DEST[95:64] := SRC2[95:64]
IF (SRC1[127]) DEST[127:96] := SRC2[127:96]

VMOVAPS (EVEX encoded versions, store-form)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
  i := j * 32
  IF k1[j] OR *no writemask*
    THEN DEST[i+31:i] := SRC[i+31:i]
    ELSE *DEST[i+31:i] remains unchanged*
  FI;
ENDFOR;

But I'm not able to find anything more than that.

This is a great example of why I believe masked loads to be a dangerous/misleading abstraction: they attempt to promise something that isn't actually efficient for current hardware.

Just a reminder: the goal of this issue is not to see if we want them. But assuming we want them, what do we want from them, and how to efficiently implement them. The reason is that to chose if we want them, it is good to know if they can be made efficient.

So I would prefer that you expose why they cannot be made efficient on most architectures and see if it is workaround-able, instead of saying that it's not efficient and should be avoided.

If we rely on the fact that single-element stores are atomic

We do not need element atomicity, just race-condition freedom on inactive elements. In other words, we want that inactive elements do not introduce any race conditions. But if there are race conditions on active elements, atomicity is not guaranteed.

I do not see what benefit that brings vs. users just writing that loop directly. It would at least make more visible how expensive this is going to be.

If we provide a WASM instruction, the scalar emulation will be there only if there is no hardware support. If we let users (or compilers) generate this scalar emulation, there is no way the WASM engine will be able to generate the native masked instruction.

I would also like to remind that SSE has a masked store instruction. So the question is mostly about Neon (and Altivec?).

jan-wassenberg commented 4 years ago

But I'm not able to find anything more than that.

Yes, it is unfortunately not made explicit. I would consider it risky to assert any claim of race-freedom (even for inactive lanes) on x86.

lemaitre commented 4 years ago

Yes, it is unfortunately not made explicit. I would consider it risky to assert any claim of race-freedom (even for inactive lanes) on x86.

As a core has to get exclusive access to the cache line in order to write to it, we are sure that there can be no race condition on inactive elements between cores. In case of hyperthreaded context, it is a bit more blurry. According to Agner Fog, the latency of most masked store is about a dozen cycles (the throughput is about 1 instr/cycle), which means it is most likely micro-coded. If that is actually correct, then the stores would be physically predicated, and inactive lanes would then not be thouched at all (ensuring race-condition freedom on inactive elements). The high(-ish) latency is also an hint that the access has a byte granularity. This would mean that such a store is not atomic, even per-lane and slicing might occur even for individual lanes. I consider that the lack of per-element atomicity is not a problem.

lemaitre commented 4 years ago

I finally took the time to test what is the actual behavior of masked store instructions (check out the gist).

I tested 4 ways to "mask store" 4x32 bits:

scalar emulation of the store
the sse2 masked store
the avx2 masked store
the avx512 masked store

All the threads access the same memory location, but their mask are non-conflicting (in practice, only the 4 first threads have non-zero masks).

All the variants give correct results on Skylake-X and Haswell event with hyper threading. Slight caveat: the sse2 masked store segfaults if the location is not aligned despite the documentation.

I also measured the speed of all these variants:	variant	platform
scalar	SKX	17.6
SSE2	SKX	801.5
AVX2	SKX	18.2
AVX512	SKX	17.7
scalar	HSW	17.0
SSE2	HSW	404.0
AVX2	HSW	17.0

In the light of the measurements, scalar emulation is inexpensive and can actually be used. The mask store from SSE2 should be avoided at all cost, but at least give correct results.

It would be interesting to see how the results changes with smaller types (byte access) and longer vectors (256 and 512 bits). Some preliminary results seem to show that the time for the vector one does not increase with the length of the vector.

lemaitre commented 4 years ago

I just measured both throughput and latencies of all the possible combinations for masked stores. Previous post numbers where latencies.

The columns indicates the size and the number of elements within a vector. There is 5 versions:

st: a regular, non-masked store (just for comparison)
scalar: scalar emulation of the masked store (with extract lanes and branches)
SSE: use of _mm_maskmoveu_si128
AVX2: use of _mm*_maskstore_epi* (available only for 32 and 64 bit types)
AVX512: use of _mm*_mask_storeu_epi* (mask register)

Throughput⁻¹ (cycles):

arch	version	8x16	16x8	32x4	64x2	8x32	16x16	32x8	64x4	8x64	16x32	32x16	64x8
HSW	st	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
HSW	scalar	32.00	16.00	8.00	4.50	66.00	36.00	16.00	8.00
HSW	SSE	10.00	10.00	10.00	10.00	15.00	15.00	15.00	15.00
HSW	AVX2			1.60	1.60			1.50	1.60
SKX	st	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
SKX	scalar	32.20	16.20	8.00	4.50	103.00	35.50	16.25	8.00	203.00	103.40	45.10	18.40
SKX	SSE	9.00	9.00	9.00	9.00	15.00	15.00	15.00	15.00	29.15	29.15	29.15	29.15
SKX	AVX2			1.25	1.25			1.25	1.25			2.50	2.50
SKX	AVX512	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	2.00	1.00	1.00	1.00

Latencies (cycles):

arch	version	8x16	16x8	32x4	64x2	8x32	16x16	32x8	64x4	8x64	16x32	32x16	64x8
HSW	st	6.33	6.33	6.33	6.33	7.00	7.00	7.00	7.00
HSW	scalar	42.00	28.00	20.00	18.00	74.00	43.00	28.00	24.00
HSW	SSE	~400	~400	~400	~400	~400	~400	~400	~400
HSW	AVX2			16.00	16.00			17.00	17.00
SKX	st	5.00	5.00	5.00	5.00	6.50	6.50	6.50	6.50	6.35	6.35	6.35	6.35
SKX	scalar	42.00	30.44	22.30	19.30	124.60	43.75	32.00	24.50	228.30	125.30	51.11	34.68
SKX	SSE	~600	~600	~600	~600	~600	~600	~600	~600	~600	~600	~600	~600
SKX	AVX2			15.00	15.00			16.00	16.00			16.00	16.00
SKX	AVX512	15.00	15.00	15.00	15.00	16.00	16.00	16.00	16.00	15.50	16.00	16.00	16.00

First thing: _mm_maskmoveu_si128 is a bit special: it has a non-temporal hint (it also requires alignment). This means that the store will bypass the caches entirely. Hence the very high latency. Throughput-wise, it is actually decent and could be used for loop remainders (as the end of the dataset will likely not be accesses just after).

Then, scalar is good only when the number of elements is low (either SIMD is small or type is large). This would be enough to emulate masked stores on SSE hardware though.

AVX2 and AVX512 are really good, especially their throughput. Latencies are high-ish, but during my experiments, it sometimes dropped to ~5-6 cycles (depending on the mask).

In light of those measurements, masked store for SSE are OK (with scalar emulation). Masked stores for AVX2 are good for 32-bit and 64-bit types, but scalar emulation for smaller types will incur a high penalty. Masked stores for AVX512 are good, period.

The only "issue" would be 8-bit and 16-bit integer masked store on AVX2 which would be slow. We could then provide all masked stores and have a warning about the speed for small types on old hardware.

Also, for loop epilog, the same mask can be reused for multiple masked stores, thus, it will not be necessary to duplicate the "lane branching" logic. But in case of loop epilog, a storeOnlyN will be even more efficient as it can just be a computed goto.

Therefore, I think it can be good to also standardized storeOnlyN, maybe with 2 versions, storeOnlyFirstN and storeOnlyLastN.

What do you think?

WebAssembly / flexible-vectors