KhronosGroup / GLSL

GLSL Shading Language Specification and Extensions
Other
337 stars 98 forks source link

Possible error in GL_KHR_shader_subgroup.txt : Useless subgroupMemoryBarrier functions? #50

Open qnope opened 5 years ago

qnope commented 5 years ago

The function subgroupBarrier performs both an execution and a full memory barrier

The function subgroupBarrier() enforces that all active invocations within a subgroup must execute this function before any are allowed to continue their execution and the results of any memory stores performed using coherent variables performed prior to the call will be visible to any future coherent access to the same memory performed by any other shader invocation within the same subgroup.

I wonder if there is any usefulness for subgroupMemoryBarrier functions since they do not perform an execution barrier :

The function subgroupMemoryBarrier() enforces the ordering of all memory transactions issued within a single shader invocation, as viewed by other invocations in the same subgroup.

However, it is written that the invocations within a subgroup run in parallel :

A subgroup is a set of invocations exposed as running concurrently with the current shader invocation. The number of invocations within a subgroup (the size of the subgroup) is a fixed property of the device.

Since they are running in parallel, is it useful to have an execution barrier? If it is not needed, the subgroupMemoryBarrier functions became useful, but the subgroupBarrier function becomes useless.

However, if we do need an execution barrier within a subgroup, I think it is necessary to synchronize other operation like shuffling as well. However, it is probably implicit like __shfl_down_sync in CUDA.

So,

jeffbolznv commented 5 years ago

The definitive answer for "what do subgroupBarrier and subgroupMemoryBarrier do" is to take the SPIR-V mappings from the GL_KHR_shader_subgroup extension and interpret them through the https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#memory-model appendix in the Vulkan spec. But suffice it to say that neither is "useless".

A subgroup is a set of invocations exposed as running concurrently with
the current shader invocation. The number of invocations within a
subgroup (the size of the subgroup) is a fixed property of the device.

Since they are running in parallel, is it useful to have an execution barrier?

The word "concurrently" here is vague and it should not be interpreted to mean they are running in lockstep. An implementation is free to let them run out of step with each other, and only sync them back up when required by barriers.

nhaehnle commented 5 years ago

To expand on Jeff's comment, consider an artificial example:

atomicStore(shared.x[tid], 1, gl_ScopeSubgroup, gl_StorageSemanticsShared, gl_SemanticsRelaxed);
atomicStore(shared.y[tid], 1, gl_ScopeSubgroup, gl_StorageSemanticsShared, gl_SemanticsRelaxed);

ry = atomicLoad(shared.y[tid ^ 1], gl_ScopeSubgroup, gl_StorageSemanticsShared, gl_SemanticsRelaxed);
rx = atomicLoad(shared.x[tid ^ 1], gl_ScopeSubgroup, gl_StorageSemanticsShared, gl_SemanticsRelaxed);

Assuming that shared.x and shared.y are all initialized to 0 earlier, what are the possible values of (rx, ry)? You may naively think that it must be (1, 1) as long as threads tid and tid ^ 1 are in the same subgroup, but in fact all of (0, 0), (0, 1), or (1, 0) are possible outcomes as well!

Now, let's say you add subgroupMemoryBarrier()s as follows:

atomicStore(shared.x[tid], 1, gl_ScopeSubgroup, gl_StorageSemanticsShared, gl_SemanticsRelaxed);
subgroupMemoryBarrier();
atomicStore(shared.y[tid], 1, gl_ScopeSubgroup, gl_StorageSemanticsShared, gl_SemanticsRelaxed);

ry = atomicLoad(shared.y[tid ^ 1], gl_ScopeSubgroup, gl_StorageSemanticsShared, gl_SemanticsRelaxed);
subgroupMemoryBarrier();
rx = atomicLoad(shared.x[tid ^ 1], gl_ScopeSubgroup, gl_StorageSemanticsShared, gl_SemanticsRelaxed);

There is still no guaranteed ordering between the executions of different threads in the same subgroup, but the memory barriers introduce dependencies between accesses of shared.x and shared.y, with the result that only (0, 0), (1, 0), and (1, 1) are possible outcomes. Think of shared.y as a flag that indicates whether the value of shared.x is "ready", and the subgroup memory barrier ensures you get consistent results. ((1, 0) is still a possible outcome because you might read ry == 0 indicating that the result is not ready, but then the other thread updates shared.x before you read from it.)

Note: The accesses to shared.x have to be atomic as long as the read from shared.x is unconditional, because otherwise there would be a data race in the ry == 0 case. If the load of shared.x was guarded by a check that ry == 1, then the accesses to shared.x would not have to be atomic.

So is this a concern in practice? Yes, it is: even an implementation where subgroups execute lockstep in parallel can have an optimizing compiler that may want to re-arrange those atomic memory accesses (hoisting loads is a good idea for memory latency hiding, for example...). The compiler can prove that the memory accesses are all to disjoint memory locations (from the perspective of a single thread), and since all the atomics are relaxed, reordering them is allowed. So get your memory barriers right -- and subgroupMemoryBarrier() is a reasonably weak memory barrier that still can have a useful effect, as shown (though it isn't the weakest possible memory barrier -- a release-only or acquire-only barrier is weaker and can be expressed using GL_KHR_memory_scope_semantics).

Finally, what the control barrier (subgroupBarrier()) gives you in this example is that if you place it between the stores and the loads, it restricts the possible outcomes to just (1, 1), since it guarantees that the stores of both threads execute before the loads of both threads. Using the subgroupBarrier() makes atomic loads and stores unnecessary. Note, however, that a pure control barrier (without memory semantics) is insufficient. You really need both the control and the memory barrier aspects to get the restriction (and subgroupBarrier() does have both).

If you want to explore this further, you can use the alloy-based tool in the Vulkan-MemoryModel repository. After setting it up according to the readme, create a file alloy/tests/experiment.test.gen and run make experiment.test.gen using the following base:

NEWWG
NEWSG
NEWTHREAD 0
st.atom.scopesg.sc0 x0 = 1
st.atom.scopesg.sc0 y0 = 1
ld.atom.scopesg.sc0 y1 = 0
ld.atom.scopesg.sc0 x1 = 1

NEWTHREAD 1
st.atom.scopesg.sc0 x1 = 1
st.atom.scopesg.sc0 y1 = 1
ld.atom.scopesg.sc0 y0 = 1
ld.atom.scopesg.sc0 x0 = 0

SATISFIABLE consistent[X]

and play around with changing the values being loaded. You'll see that basically anything you throw at it is consistent as far as the memory model is concerned. However, add appropriate memory barriers as follows (I've removed the memory accesses to shared.x[1] and shared.y[1] to make the test run faster):

NEWWG
NEWSG
NEWTHREAD 0
st.atom.scopesg.sc0 x0 = 1
membar.rel.scopesg.semsc0
st.atom.scopesg.sc0 y0 = 1

NEWTHREAD 1
ld.atom.scopesg.sc0 y0 = 1
membar.acq.scopesg.semsc0
ld.atom.scopesg.sc0 x0 = 0

NOSOLUTION consistent[X]

... and you'll see that (0, 1) is not a possible outcome. Note that I've used release-only and acquire-only barriers here instead of the stronger rel.acq equivalent of subgroupMemoryBarrier(), since those are in fact strong enough for this running example.