Multidimensional shuffles do not map to OpenCL subgroups

CHIP-SPV / chipStar

chipStar is a tool for compiling and running HIP/CUDA on SPIR-V via OpenCL or Level Zero APIs.

Other

184 stars 29 forks source link

Multidimensional shuffles do not map to OpenCL subgroups #219

Open pjaaskel opened 1 year ago

pjaaskel commented 1 year ago

For example, the 2d_shuffle test case assumes warp (subgroup) width is at least 16 and that the lanes map the threads in a linear order to make a full matrix transpose using a warp shuffle. The lane mapping or the subgroup width is not fixed by the subgroup feature of OpenCL. We can use the required subgroup extension for fixing the subgroup width to 32 to emulate the warps, but the thread to lane mapping is still implementation specific. For this I've started drafting another extension.

Luckily, the Intel GPU drivers already implement the linear mapping at least when shuffles are detected (it seems -- the 2d_shuffle test case at least works), but it's not documented behavior. Intel CPU driver sets the subgroup width according to the innermost loop (the vectorized dimension), thus exchanging data across multiple dimensions doesn't work as the subgroup processes one "row" at the time (#142).

pvelesko commented 1 year ago

but the thread to lane mapping is still implementation specific

Any reason why you would want a mapping that's not linear?

Since Intel developed this extension and only Intel uses it, was there a reason for not specifying that the mapping must linear?

@pjaaskel @bashbaug

pvelesko commented 1 year ago

Intel CPU driver sets the subgroup width according to the innermost loop (the vectorized dimension)

Does the CPU driver not enforce cl_intel_required_subgroup_size ? @pjaaskel

pjaaskel commented 1 year ago

@pvelesko it should, but does not (yet) guarantee the linear ids nor multidim. OpenCL doesn't even support multidimensional subgroups. In fact, it explicitly states that the subgroups are single dimensional. Luckily the Intel iGPUs seem to implemement multidim with the reqd subgroup extension so it works by luck. I believe we need to extend the OpenCL spec to allow multim subgroups. Perhaps include it in the cl_intel_required_subgroup_size: When a single dimension is not enough to fill it up, then include WIs from multiple dimensions in the subgroup (preferably with linear mapping).

pjaaskel commented 1 year ago

I've misinterpreted the sentence of the OpenCL specs "While sub-groups may be used in multi-dimensional work-groups, each sub-group is 1-dimensional and any given work-item may query which sub-group it is a member of." to mean that subgroups can get WIs only from one row, column or z-dim at a time. I was corrected that this is not the intention of this sentence, but it's only meant to refer to indexing of the WIs in the subgroup, for that there's only the flattened out get_local_subgroup_id() function. That is, the implementation can map multiple WG "rows" to a subgroup to fill the lanes up. This means that we only need to ensure the mapping is row-oder like in CUDA, for which the above mentioned extension is pending for.

pvelesko commented 1 year ago

still open?

pjaaskel commented 1 year ago

Not sure if we should close this yet. Although this is not a CHIP-SPV issue, the problem in OpenCL-side is not resolved before there's an extension mandating the linear id. That is being discussed&drafted. Might be good to keep this open to keep track and link to any proceedings. After all the warp-level primitives are essential for CUDA/HIP correct execution.