Closed rjfnobre closed 7 years ago
Hi, As of today no. Currently, HIP only supports just programming APIs.
Thanks for the fast reply!
So how can I implement something like shown here? https://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/
If I convert the kernel (using hipify): inline device int warpReduceSum(int val) { for (int offset = warpSize/2; offset > 0; offset /= 2) val += __shfl_down(val, offset); return val; }
It converts to:
inline device int warpReduceSum(int val) { for (int offset = hipWarpSize/2; offset > 0; offset /= 2) val += __shfl_down(val, offset); return val; }
I get errors with hipcc, saying that it does not recognize 'hipWarpSize'.
What is the correct translation between this piece of CUDA code and HIP code?
Hi, We are working on hipWarpSize. For now, you can define it as a macro of 64 for HIP and 32 for CUDA. shuffle instructions are implemented in the compiler.
Is there already some HIP Sum reduction kernel for floating points?
Hi, Not yet. Did you try changing hipWarpSize to 64?
__inline__ __device__
int warpReduceSum(int val) {
for (int offset = 64/2; offset > 0; offset /= 2)
val += __shfl_down(val, offset);
return val;
}
By "implemented in the compiler" do you mean '__shfl_down' should work with hipcc? Because at least for the version I'm using hipcc complains about it.
I believe it only works with nvcc (the nvidia compiler).
Hi, Here are the APIs in header. https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP/blob/master/include/hcc_detail/hip_runtime.h#L405
When using: val += __shfl_down(val, offset, hipWarpSize);
I get when compiling with hipcc:
error:
Failed to get ehdr: invalid `Elf' handle clang-3.5: error: linker command failed with exit code 1 (use -v to see invocation) Died at /opt/rocm/bin/hipcc line 230.
This is the complete code:
inline device TYPE warpReduceSum(TYPE val) { for (int offset = hipWarpSize/2; offset > 0; offset /= 2) val += __shfl_down(val, offset, hipWarpSize); return val; }
inline device TYPE blockReduceSum(int val) {
shared TYPE shared[64]; // Shared mem for 32 partial sums int lane = hipThreadIdx_x % hipWarpSize; int wid = hipThreadIdx_x / hipWarpSize;
val = warpReduceSum(val); // Each warp performs partial reduction
if (lane==0) shared[wid]=val; // Write reduced value to shared memory
__syncthreads(); // Wait for all partial reductions
//read from shared memory only if that warp existed val = (hipThreadIdx_x < hipBlockDim_x / hipWarpSize) ? shared[lane] : 0;
if (wid==0) val = warpReduceSum(val); //Final reduce within first warp
return val; }
global void deviceReduceKernel(hipLaunchParm lp, TYPE in, TYPE out, int N) { TYPE sum = 0; //reduce multiple elements per thread for (int i = hipBlockIdx_x * hipBlockDim_x + hipThreadIdx_x; i < N; i += hipBlockDim_x * hipGridDim_x) { sum += in[i]; } sum = blockReduceSum(sum); if (hipThreadIdx_x==0) out[hipBlockIdx_x]=sum; }
The __shfl_down has only been implemented recently in the hcc compiler with the GCN ISA backend (if you see a link /opt/rocm/hcc --> /opt/rocm/hcc-lc). You could get the compiler update either by building it from source (you'll also have to rebuild hip with the new compiler) or you could wait for a few days more to get a new binary update from apt-get.
As the conversation is inactive, closing issue.
Hi!
I'm using a CUDA code that calls a function from the NVIDIA CUB library.
This is the code:
'in_d' is the input array with the values to be reduced, 'sum_d' is an array with a single position to include the result of the complete reduction of the array, and 'N' is the number of elements in the array that are to be reduced.
Is there any equivalent HIP library function for Sum reduction that I can use on an AMD Fiji card (AMD R9 Nano)?