Open pjaaskel opened 1 year ago
but the thread to lane mapping is still implementation specific
Any reason why you would want a mapping that's not linear?
Since Intel developed this extension and only Intel uses it, was there a reason for not specifying that the mapping must linear?
@pjaaskel @bashbaug
Intel CPU driver sets the subgroup width according to the innermost loop (the vectorized dimension)
Does the CPU driver not enforce cl_intel_required_subgroup_size
? @pjaaskel
@pvelesko it should, but does not (yet) guarantee the linear ids nor multidim. OpenCL doesn't even support multidimensional subgroups. In fact, it explicitly states that the subgroups are single dimensional. Luckily the Intel iGPUs seem to implemement multidim with the reqd subgroup extension so it works by luck. I believe we need to extend the OpenCL spec to allow multim subgroups. Perhaps include it in the cl_intel_required_subgroup_size
: When a single dimension is not enough to fill it up, then include WIs from multiple dimensions in the subgroup (preferably with linear mapping).
I've misinterpreted the sentence of the OpenCL specs "While sub-groups may be used in multi-dimensional work-groups, each sub-group is 1-dimensional and any given work-item may query which sub-group it is a member of." to mean that subgroups can get WIs only from one row, column or z-dim at a time. I was corrected that this is not the intention of this sentence, but it's only meant to refer to indexing of the WIs in the subgroup, for that there's only the flattened out get_local_subgroup_id() function. That is, the implementation can map multiple WG "rows" to a subgroup to fill the lanes up. This means that we only need to ensure the mapping is row-oder like in CUDA, for which the above mentioned extension is pending for.
still open?
Not sure if we should close this yet. Although this is not a CHIP-SPV issue, the problem in OpenCL-side is not resolved before there's an extension mandating the linear id. That is being discussed&drafted. Might be good to keep this open to keep track and link to any proceedings. After all the warp-level primitives are essential for CUDA/HIP correct execution.
For example, the 2d_shuffle test case assumes warp (subgroup) width is at least 16 and that the lanes map the threads in a linear order to make a full matrix transpose using a warp shuffle. The lane mapping or the subgroup width is not fixed by the subgroup feature of OpenCL. We can use the required subgroup extension for fixing the subgroup width to 32 to emulate the warps, but the thread to lane mapping is still implementation specific. For this I've started drafting another extension.
Luckily, the Intel GPU drivers already implement the linear mapping at least when shuffles are detected (it seems -- the 2d_shuffle test case at least works), but it's not documented behavior. Intel CPU driver sets the subgroup width according to the innermost loop (the vectorized dimension), thus exchanging data across multiple dimensions doesn't work as the subgroup processes one "row" at the time (#142).