Closed ekuznetsov139 closed 6 months ago
This is the beginning of the code
template <typename T, int W>
__global__ void testkernel(int N, int k, const T* in, float* out)
{
int idx = threadIdx.y + blockDim.y*(blockIdx.x + 1024*blockIdx.y);
float2 sum = {0,0};
assert(!(blockDim.x & 31));
__shared__ float2 sums[W];
...
and this is the beginning of the ISA
; %bb.0: ; %entry
s_load_dwordx2 s[2:3], s[4:5], 0x4
s_load_dword s12, s[4:5], 0x4
s_load_dwordx2 s[0:1], s[4:5], 0xc
s_waitcnt lgkmcnt(0)
s_lshr_b32 s2, s2, 16 // s2 >>= 16
s_mul_i32 s2, s2, s3 // s2 *= s3
v_mul_lo_u32 v3, s2, v0 // v3 = v0*(s2>>16) * s3
v_mul_lo_u32 v4, v1, s3 // v4 = v1*s3
s_and_b32 s2, s12, 0xffff // s2 = s12 & 0xffff
s_mul_i32 s3, s10, s2 // s3 = s10*(blockDim.x & 0xffff)
s_sub_i32 s0, s0, s3 // s0 -= s3
s_min_u32 s12, s0, s2 // s12 = min(s0,(blockDim.x & 0xffff))
v_add_u32_e32 v3, v3, v4 // v3 += v4
s_and_b32 s0, s12, 31
v_add_lshl_u32 v2, v3, v2, 3 // v2 = (v2+v3)*8
v_mov_b32_e32 v3, 0
v_mov_b32_e32 v4, v3 // v3 = v4 = 0
s_cmp_eq_u32 s0, 0
ds_write_b64 v2, v[3:4] // *v2 = [0, 0]
s_cbranch_scc0 BB1_13 // assert((s12 & 31) == 0); s12 is blockDim.x
; %bb.1: ; %if.end
it looks like the kernel is trying to use those extra 2048 bytes to keep the 'sum' variable. Which it can't do, since there are 1024 threads and it needs 8 bytes for each. Something is fishy here.
I've verified that the kernel also fails with the "official" ROCm 2.6 install from http://repo.radeon.com/rocm/ on vanilla Ubuntu 18.04.3.
Going back through old releases, it fails with 2.0 (hcc 1.3.18482, hip_base 1.5.18494, hip_hcc 1.5.18494), but it compiles correctly if I roll back to hcc 1.2.18451, hip_base 1.5.18442, hip_hcc 1.5.18442.
[Talking to myself is very relaxing. No one there to argue with...]
The bug first shows up in the commit fb5199d dated July 16, 2018, after some header changes. However, it is not actually triggered by those changes, it is only exposed by them.
The fundamental problem is that the variable 'sum' is being offloaded into shared memory space, and the compiler needs to allocate some shared memory for it. And it appears to do so under the assumption that the work group has at most 256 work items. (I think that number is coming from here https://github.com/RadeonOpenCompute/llvm/blob/amd-common/lib/Target/AMDGPU/AMDGPUSubtarget.cpp#L349 )
The limitation of 256 work items / work group has been noted previously (see https://github.com/RadeonOpenCompute/ROCm/issues/330 and https://github.com/RadeonOpenCompute/ROCm/issues/58), and clinfo reports, confusingly,
Max work item sizes 1024x1024x1024
Max work group size 256
Preferred work group size (AMD) 256
Max work group size (AMD) 1024
and the problem goes away if I reduce the block size from 64x16 to 64x4.
But the kernel does launch successfully at 64x16.
So, either memory allocation needs to be fixed, or the kernel needs to be prevented from launching in this scenario.
When a kernel is meant to be launched with more than 256 work-items per work-group, an attribute is required to inform the compiler that this is the intention. This attribute is mentioned here: https://clang.llvm.org/docs/AttributeReference.html#amdgpu-flat-work-group-size
Aha! Yes, adding the attribute does make the bug go away. However, it should be either applied automatically by HIP (>256 workitems per workgroup is common in CUDA), or inserted by hipify whenever an invocation with >256 workitems is detected.
And it should be mentioned explicitly in https://rocm-documentation.readthedocs.io/en/latest/Programming_Guides/HIP-porting-guide.html .
I've just grepped through the source of tensorflow and there isn't a single mention of amdgpu_flat_work_group_size to be found (and every single kernel is launched with 1024 workitems).
Also, the documentation states "this attribute ... is an optimization hint." By definition, optimization hints are optional. Either LLVM docs must be changed to say that the attribute is mandatory with >256 workitems, or the default value must be raised to 1024.
In the attached file, when compiled with current master HIP using
on Radeon RX Vega 64, the kernel (a simple piece of code that does reduction across a thread block using a combination of warp shuffle ops and shared memory) behaves differently depending on whether it is created with 32 float2's of shared memory or with 1024 float2's.
With 1024 floats, it works correctly. With 32 floats, results are wrong and they vary from run to run. That's despite the fact that the kernel never accesses any shared memory past the 32 entries.
It looks like some sort of race condition / synchronization failure. I don't see anything wrong with my code and I suspect a compiler bug.
I tried to look at the ISA code and I see that it is substantially different between the two versions. The one that works correctly allocates 16 bytes per workitem of "private segment" and 8192 bytes per block of "group segment" (shared memory?) The one that does not work correctly allocates 2304 bytes per block of "group segment", consisting of my 32 float2's and an additional 2048 bytes it uses for internal calculations. I can't quite follow the logic but I don't see how it can use those 2048 bytes across 1024 threads without collisions.
vectoradd_hip.zip dump-gfx900.zip