Open chufanchen opened 3 months ago
Key requirements for a reduction operator $\circ$ are:
Together, they mean that the elements can be re-arranged and combined in any order.
Assuming each thread starts with one value, the approach is to
The first phase does parallel summation of $N$ values:
Are there any problems with warp divergence? Note that not all threads can be busy all of the time:
For efficiency, we want to make sure that each warp is either fully active or fully inactive, as far as possible.
Where should data be held. Threads need to access results produced by other threads:
__global__ void sum(float *d_sum, float *d_data) {
extern __shared__ float temp[];
int tid = threadIdx.x;
temp[tid] = d_data[tid+blockIdx.x*blockDim.x];
for (int d=blockDim.x/2; d>0; d=d/2) {
__syncthreads();
if (tid<d) temp[tid] += temp[tid+d];
}
if (tid == 0) d_sum[blodkIdx.x] = temp[0];
}
Warp shuffles are a faster mechanism for moving data between threads in the same warp. There are 4 variants:
__shfl_up_sync
: copy from a lane with lower ID relative to caller__shfl_down_sync
: copy from a lane with higher ID relative to caller__shfl_xor_sync
: copy from a lane based on bitwise XOR of own lane ID__shfl_sync
: copy from indexed lane IDHere the lane ID is the position within the warp.
mask
controls which threads are involved - usually set to-1
or0xffffffff
, equivalent to all 1's.var
is a local register variable(int, unsigned int, long long, unsigned long long, float or double)delta
is the offset within the warp – if the appropriate thread does not exist (i.e. it’s off the end of the warp) then the value is taken from the current thread.laneMask
and the calling thread’slaneID
to determine the lane from which to copy the value(laneMask
controls which bits oflaneID
are flipped)srcLane