Lecture 4 | CUDA Programming

Warp shuffles are a faster mechanism for moving data between threads in the same warp. There are 4 variants:

__shfl_up_sync: copy from a lane with lower ID relative to caller
__shfl_down_sync: copy from a lane with higher ID relative to caller
__shfl_xor_sync: copy from a lane based on bitwise XOR of own lane ID
__shfl_sync: copy from indexed lane ID

Here the lane ID is the position within the warp.

T __shfl_up_sync(unsigned mask, T var, unsigned int delta);
T __shfl_down_sync(unsigned mask, T var, unsigned int delta);

mask controls which threads are involved - usually set to -1 or 0xffffffff, equivalent to all 1's.
var is a local register variable(int, unsigned int, long long, unsigned long long, float or double)
delta is the offset within the warp – if the appropriate thread does not exist (i.e. it’s off the end of the warp) then the value is taken from the current thread.

T __shfl_xor_sync(unsigned mask, T var, int laneMask);

an XOR (exclusive or) operation is performed between laneMask and the calling thread’s laneID to determine the lane from which to copy the value(laneMask controls which bits of laneID are flipped)
a “butterfly” type of addressing, very useful for reduction operations and FFTs

T __shfl_sync(unsigned mask, T var, int srcLane);

copies data from srcLane

__global__ void sum(float *d_sum, float *d_data) { extern __shared__ float temp[]; int tid = threadIdx.x; temp[tid] = d_data[tid+blockIdx.x*blockDim.x]; for (int d=blockDim.x/2; d>0; d=d/2) { __syncthreads(); if (tid<d) temp[tid] += temp[tid+d]; } if (tid == 0) d_sum[blodkIdx.x] = temp[0]; }

chufanchen / read-paper-and-code

Lecture 4 | CUDA Programming #184