chufanchen / read-paper-and-code

0 stars 0 forks source link

Lecture 4 | CUDA Programming #184

Open chufanchen opened 3 months ago

chufanchen commented 3 months ago

Warp shuffles are a faster mechanism for moving data between threads in the same warp. There are 4 variants:

Here the lane ID is the position within the warp.

T __shfl_up_sync(unsigned mask, T var, unsigned int delta);
T __shfl_down_sync(unsigned mask, T var, unsigned int delta);
T __shfl_xor_sync(unsigned mask, T var, int laneMask);
T __shfl_sync(unsigned mask, T var, int srcLane);
chufanchen commented 3 months ago

Key requirements for a reduction operator $\circ$ are:

Together, they mean that the elements can be re-arranged and combined in any order.

Assuming each thread starts with one value, the approach is to

  1. first add the values within each thread block, to form a partial sum. (Local reduction)
  2. then add together the partial sums from all of the blocks. (Global reduction)

The first phase does parallel summation of $N$ values:

  1. first sum them in pairs to get $N/2$ values
  2. repeat the procedure until we have only one value

Are there any problems with warp divergence? Note that not all threads can be busy all of the time:

For efficiency, we want to make sure that each warp is either fully active or fully inactive, as far as possible.

Where should data be held. Threads need to access results produced by other threads:

__global__ void sum(float *d_sum, float *d_data) {
  extern __shared__ float temp[];
  int tid = threadIdx.x;
  temp[tid] = d_data[tid+blockIdx.x*blockDim.x];
  for (int d=blockDim.x/2; d>0; d=d/2) {
    __syncthreads();
    if (tid<d) temp[tid] += temp[tid+d];
  }
  if (tid == 0) d_sum[blodkIdx.x] = temp[0];
}