ROCm / rocPRIM

ROCm Parallel Primitives
https://rocm.docs.amd.com/projects/rocPRIM/
MIT License
162 stars 69 forks source link

rocPRIM reduction (block_reduce_int) issue #389

Closed hokkanen closed 1 year ago

hokkanen commented 1 year ago

Describe the bug My simple reduction kernel that uses block_reduce_int() produces incorrect results with rocPRIM release branches 4.4+. Release branch 4.3 works fine. The same code compiles and runs fine with HIP on V100 and A100 based NVIDIA systems (CSC Puhti and Mahti supercomputers). I'm not 100% positive it is rocPRIM side issue but looks like it.

To Reproduce Compiling and running https://github.com/hokkanen/rocprim_issue/blob/master/hipcub_demo.cpp with rocPRIM release branches 4.4+ (see lines 14 and 17 to include the rocPRIM header) on Lumi supercomputer produces the issue.

Expected behavior I expect to see the following:

The results calculated by GPU = 499500 and CPU = 499500 match!

Instead, I see this:

The results calculated by GPU = 460320 and CPU = 499500 do not match!

Environment environment.txt

Maetveis commented 1 year ago

In the reproducer, on the last block of the launch, block_reduce won't be called by all threads in the block.

rocPRIM's block collective primitives must be called in block-uniform control flow, otherwise they might hang (the block algorithms are allowed to contain __syncthreads) or otherwise produce incorrect results. This applies to CUB as well, so the reproducer probably just happens to work on CUB (will probably fail with a larger blocksize).

Can you try with the following modification?

// Local storage for the thread summation value
int thread_sum = 0;
// Check loop limits
if (idx < loop_size) {
  // Evaluate the loop body, the summation value is stored in thread_sum
  loop_body(idx, thread_sum);
}

// Compute the block-wide sum (aggregate) for the first thread of each block
#if USE_HIPCUB == 1
  int aggregate = BlockReduce(temp_storage).Sum(thread_sum);
#else
  int aggregate;
  block_reduce_int().reduce(thread_sum, aggregate, temp_storage, rocprim::plus<int>());
#endif

// The first thread of each block stores the block-wide aggregate to 'sum' using atomics
if(threadIdx.x == 0) 
  atomicAdd(sum, aggregate);
hokkanen commented 1 year ago

Thanks, this indeed fixed the issue. Thanks a lot.