ROCm / HIP

HIP: C++ Heterogeneous-Compute Interface for Portability
https://rocmdocs.amd.com/projects/HIP/
MIT License
3.71k stars 528 forks source link

Hipifying a Cuda file that has a call to a reduction function #19

Closed rjfnobre closed 7 years ago

rjfnobre commented 8 years ago

Hi!

I'm using a CUDA code that calls a function from the NVIDIA CUB library.

This is the code:

    // Determine temporary device storage requirements
    void     *d_temp_storage = NULL;
    size_t   temp_storage_bytes = 0;

    cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, in_d, sum_d, N);

    // Allocate temporary storage
    CHECK(cudaMalloc(&d_temp_storage, temp_storage_bytes));

    // Run sum-reduction
    cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, in_d, sum_d, N);

'in_d' is the input array with the values to be reduced, 'sum_d' is an array with a single position to include the result of the complete reduction of the array, and 'N' is the number of elements in the array that are to be reduced.

Is there any equivalent HIP library function for Sum reduction that I can use on an AMD Fiji card (AMD R9 Nano)?

aditya4d1 commented 8 years ago

Hi, As of today no. Currently, HIP only supports just programming APIs.

rjfnobre commented 8 years ago

Thanks for the fast reply!

So how can I implement something like shown here? https://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/

If I convert the kernel (using hipify): inline device int warpReduceSum(int val) { for (int offset = warpSize/2; offset > 0; offset /= 2) val += __shfl_down(val, offset); return val; }

It converts to:

inline device int warpReduceSum(int val) { for (int offset = hipWarpSize/2; offset > 0; offset /= 2) val += __shfl_down(val, offset); return val; }

I get errors with hipcc, saying that it does not recognize 'hipWarpSize'.

What is the correct translation between this piece of CUDA code and HIP code?

aditya4d1 commented 8 years ago

Hi, We are working on hipWarpSize. For now, you can define it as a macro of 64 for HIP and 32 for CUDA. shuffle instructions are implemented in the compiler.

rjfnobre commented 8 years ago

Is there already some HIP Sum reduction kernel for floating points?

aditya4d1 commented 8 years ago

Hi, Not yet. Did you try changing hipWarpSize to 64?

__inline__ __device__
int warpReduceSum(int val) {
for (int offset = 64/2; offset > 0; offset /= 2)
val += __shfl_down(val, offset);
return val;
}
rjfnobre commented 8 years ago

By "implemented in the compiler" do you mean '__shfl_down' should work with hipcc? Because at least for the version I'm using hipcc complains about it.

I believe it only works with nvcc (the nvidia compiler).

aditya4d1 commented 8 years ago

Hi, Here are the APIs in header. https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP/blob/master/include/hcc_detail/hip_runtime.h#L405

rjfnobre commented 8 years ago

When using: val += __shfl_down(val, offset, hipWarpSize);

I get when compiling with hipcc:

error: :0:0: in function ZN12_GLOBALN_154_Z18deviceReduceKernel16grid_launch_parmPfS0_i_functor19cxxamptrampolineEiiiiiiiiiiiijPKfPfi void (i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, float addrspace(1), float addrspace(1)_, i32): unsupported call to function __hsail_activelanepermute_b32

Failed to get ehdr: invalid `Elf' handle clang-3.5: error: linker command failed with exit code 1 (use -v to see invocation) Died at /opt/rocm/bin/hipcc line 230.

rjfnobre commented 8 years ago

This is the complete code:

define hipWarpSize 64 // AMD: 64, NVIDIA: 32

inline device TYPE warpReduceSum(TYPE val) { for (int offset = hipWarpSize/2; offset > 0; offset /= 2) val += __shfl_down(val, offset, hipWarpSize); return val; }

inline device TYPE blockReduceSum(int val) {

shared TYPE shared[64]; // Shared mem for 32 partial sums int lane = hipThreadIdx_x % hipWarpSize; int wid = hipThreadIdx_x / hipWarpSize;

val = warpReduceSum(val); // Each warp performs partial reduction

if (lane==0) shared[wid]=val; // Write reduced value to shared memory

__syncthreads(); // Wait for all partial reductions

//read from shared memory only if that warp existed val = (hipThreadIdx_x < hipBlockDim_x / hipWarpSize) ? shared[lane] : 0;

if (wid==0) val = warpReduceSum(val); //Final reduce within first warp

return val; }

global void deviceReduceKernel(hipLaunchParm lp, TYPE in, TYPE out, int N) { TYPE sum = 0; //reduce multiple elements per thread for (int i = hipBlockIdx_x * hipBlockDim_x + hipThreadIdx_x; i < N; i += hipBlockDim_x * hipGridDim_x) { sum += in[i]; } sum = blockReduceSum(sum); if (hipThreadIdx_x==0) out[hipBlockIdx_x]=sum; }

scchan commented 8 years ago

The __shfl_down has only been implemented recently in the hcc compiler with the GCN ISA backend (if you see a link /opt/rocm/hcc --> /opt/rocm/hcc-lc). You could get the compiler update either by building it from source (you'll also have to rebuild hip with the new compiler) or you could wait for a few days more to get a new binary update from apt-get.

aditya4d1 commented 7 years ago

As the conversation is inactive, closing issue.