Closed hokkanen closed 1 year ago
In the reproducer, on the last block of the launch, block_reduce
won't be called by all threads in the block.
rocPRIM's block collective primitives must be called in block-uniform control flow, otherwise they might hang (the block algorithms are allowed to contain __syncthreads
) or otherwise produce incorrect results.
This applies to CUB as well, so the reproducer probably just happens to work on CUB (will probably fail with a larger blocksize).
Can you try with the following modification?
// Local storage for the thread summation value
int thread_sum = 0;
// Check loop limits
if (idx < loop_size) {
// Evaluate the loop body, the summation value is stored in thread_sum
loop_body(idx, thread_sum);
}
// Compute the block-wide sum (aggregate) for the first thread of each block
#if USE_HIPCUB == 1
int aggregate = BlockReduce(temp_storage).Sum(thread_sum);
#else
int aggregate;
block_reduce_int().reduce(thread_sum, aggregate, temp_storage, rocprim::plus<int>());
#endif
// The first thread of each block stores the block-wide aggregate to 'sum' using atomics
if(threadIdx.x == 0)
atomicAdd(sum, aggregate);
Thanks, this indeed fixed the issue. Thanks a lot.
Describe the bug My simple reduction kernel that uses
block_reduce_int()
produces incorrect results with rocPRIM release branches 4.4+. Release branch 4.3 works fine. The same code compiles and runs fine with HIP on V100 and A100 based NVIDIA systems (CSC Puhti and Mahti supercomputers). I'm not 100% positive it is rocPRIM side issue but looks like it.To Reproduce Compiling and running https://github.com/hokkanen/rocprim_issue/blob/master/hipcub_demo.cpp with rocPRIM release branches 4.4+ (see lines 14 and 17 to include the rocPRIM header) on Lumi supercomputer produces the issue.
Expected behavior I expect to see the following:
Instead, I see this:
Environment environment.txt