NVIDIA | An Even Easier Introduction to CUDA

void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
      y[i] = x[i] + y[i];
}

int main(void)
{
  int N = 1<<20; // 1M elements

  float *x = new float[N];
  float *y = new float[N];

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 1M elements on the CPU
  add(N, x, y);

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  delete [] x;
  delete [] y;

  return 0;
}

I want to get this computation running (in parallel) on the many cores of a GPU.

Turn add function to a function that the GPU can run, called a kernel in CUDA.

To do this, We add the specifier __global__ to the function, which tells the CUDA C++ compiler that this is a function that runs on the GPU and can be called from CPU code.

// CUDA Kernel function to add the elements of two arrays on the GPU
__global__
void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
      y[i] = x[i] + y[i];
}

These __global__ functions are known as kernels, and code that runs on the GPU is often called device code, while code that runs on the CPU is host code.

__global__ - Runs on the GPU, called from the CPU or the GPU*. Executed with <<<dim3>>> arguments. (__global__ functions can be called from other __global__ functions)
__device__ - Runs on the GPU, called from the GPU. Can be used with variabiles too.
__host__ - Runs on the CPU, called from the CPU.

Replace the calls to new in the code above with calls to cudaMallocManaged(), and replace calls to delete [] with calls to cudaFree.

// Allocate Unified Memory -- accessible from CPU or GPU
float *x, *y;
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&y, N*sizeof(float));
...
// Free memory
cudaFree(x);
cudaFree(y);

Side Note:

After CUDA 6, NVIDIA introduces Unified Memory. Unified Memory creates a pool of managed memory that is shared between the CPU and GPU, bridging the CPU-GPU divide. Managed memory is accessible to both the CPU and GPU using a single pointer. The key is that the system automatically migrates data allocated in Unified Memory between host and device so that it looks like CPU memory to code running on the CPU, and like GPU memory to code running on the GPU.

Launch the add() kernel, which invokes it on the GPU. CUDA kernel launches are specified using the triple angle bracket syntax <<< >>>.

add<<<1, 1>>>(N, x, y);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();

~$ nvcc cuda_test.cu -o add_cuda
~$ nvprof ./add_cuda

==3260685== NVPROF is profiling process 3260685, command: ./add_cuda
Max error: 0
==3260685== Profiling application: ./add_cuda
==3260685== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  174.22ms         1  174.22ms  174.22ms  174.22ms  add(int, float*, float*)
      API calls:   41.92%  176.25ms         2  88.126ms  30.539us  176.22ms  cudaMallocManaged
                   41.44%  174.22ms         1  174.22ms  174.22ms  174.22ms  cudaDeviceSynchronize
                   16.51%  69.411ms         1  69.411ms  69.411ms  69.411ms  cudaLaunchKernel
                    0.09%  378.67us         2  189.33us  175.03us  203.64us  cudaFree
                    0.04%  166.34us       114  1.4590us     141ns  63.183us  cuDeviceGetAttribute
                    0.00%  15.073us         1  15.073us  15.073us  15.073us  cuDeviceGetName
                    0.00%  7.6820us         1  7.6820us  7.6820us  7.6820us  cuDeviceGetPCIBusId
                    0.00%  1.4780us         3     492ns     191ns  1.0030us  cuDeviceGetCount
                    0.00%  1.0420us         2     521ns     184ns     858ns  cuDeviceGet
                    0.00%     506ns         1     506ns     506ns     506ns  cuDeviceTotalMem
                    0.00%     447ns         1     447ns     447ns     447ns  cuModuleGetLoadingMode
                    0.00%     277ns         1     277ns     277ns     277ns  cuDeviceGetUuid

==3260685== Unified Memory profiling result:
Device "NVIDIA TITAN X (Pascal) (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
      48  170.67KB  4.0000KB  0.9961MB  8.000000MB  759.4160us  Host To Device
      24  170.67KB  4.0000KB  0.9961MB  4.000000MB  339.1450us  Device To Host
      12         -         -         -           -  1.972004ms  Gpu page fault groups
Total CPU Page faults: 36

parallel kernel with CUDA's execution configuration syntax <<<numBlocks, blockSize>>>

add<<<1, 256>>>(N, x, y);

I need to modify the kernel to spread the computation across the parallel threads.

__global__
void add(int n, float *x, float *y)
{
  int index = threadIdx.x;
  int stride = blockDim.x;
  for (int i = index; i < n; i += stride)
      y[i] = x[i] + y[i];
}

threadIdx.x contains the index of the current thread within its block, and blockDim.x contains the number of threads in the block.

==3271798== NVPROF is profiling process 3271798, command: ./add_cuda
Max error: 0
==3271798== Profiling application: ./add_cuda
==3271798== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  2.5162ms         1  2.5162ms  2.5162ms  2.5162ms  add(int, float*, float*)
      API calls:   89.35%  167.09ms         2  83.545ms  24.286us  167.07ms  cudaMallocManaged
                    8.97%  16.782ms         1  16.782ms  16.782ms  16.782ms  cudaLaunchKernel
                    1.34%  2.5153ms         1  2.5153ms  2.5153ms  2.5153ms  cudaDeviceSynchronize
                    0.21%  399.97us         2  199.99us  182.70us  217.27us  cudaFree
                    0.10%  191.04us       114  1.6750us     136ns  73.971us  cuDeviceGetAttribute
                    0.01%  17.896us         1  17.896us  17.896us  17.896us  cuDeviceGetName
                    0.00%  8.7150us         1  8.7150us  8.7150us  8.7150us  cuDeviceGetPCIBusId
                    0.00%  1.4980us         3     499ns     197ns     927ns  cuDeviceGetCount
                    0.00%  1.1800us         2     590ns     186ns     994ns  cuDeviceGet
                    0.00%     799ns         1     799ns     799ns     799ns  cuDeviceTotalMem
                    0.00%     460ns         1     460ns     460ns     460ns  cuModuleGetLoadingMode
                    0.00%     328ns         1     328ns     328ns     328ns  cuDeviceGetUuid

==3271798== Unified Memory profiling result:
Device "NVIDIA TITAN X (Pascal) (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
      48  170.67KB  4.0000KB  0.9961MB  8.000000MB  758.6180us  Host To Device
      24  170.67KB  4.0000KB  0.9961MB  4.000000MB  340.6860us  Device To Host
      12         -         -         -           -  1.743612ms  Gpu page fault groups
Total CPU Page faults: 36

CUDA GPUs have many parallel processors grouped into Streaming Multiprocessors, or _SM_s. Each SM can run multiple concurrent thread blocks. As an example, a Titan X GPU based on the Pascal GPU Architecture has 28 SMs, each capable of supporting up to 2048 active threads.

The blocks of parallel threads make up what is known as the grid.

int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
add<<<numBlocks, blockSize>>>(N, x, y);

__global__
void add(int n, float *x, float *y)
{
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;
  for (int i = index; i < n; i += stride)
    y[i] = x[i] + y[i];
}

==3275921== NVPROF is profiling process 3275921, command: ./add_cuda
Max error: 0
==3275921== Profiling application: ./add_cuda
==3275921== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  2.0485ms         1  2.0485ms  2.0485ms  2.0485ms  add(int, float*, float*)
      API calls:   89.46%  155.38ms         2  77.692ms  24.574us  155.36ms  cudaMallocManaged
                    9.01%  15.646ms         1  15.646ms  15.646ms  15.646ms  cudaLaunchKernel
                    1.18%  2.0493ms         1  2.0493ms  2.0493ms  2.0493ms  cudaDeviceSynchronize
                    0.23%  398.66us         2  199.33us  182.16us  216.50us  cudaFree
                    0.10%  181.72us       114  1.5940us     180ns  72.099us  cuDeviceGetAttribute
                    0.01%  16.007us         1  16.007us  16.007us  16.007us  cuDeviceGetName
                    0.01%  9.8340us         1  9.8340us  9.8340us  9.8340us  cuDeviceGetPCIBusId
                    0.00%  1.4360us         3     478ns     232ns     914ns  cuDeviceGetCount
                    0.00%  1.0640us         2     532ns     192ns     872ns  cuDeviceGet
                    0.00%     630ns         1     630ns     630ns     630ns  cuDeviceTotalMem
                    0.00%     458ns         1     458ns     458ns     458ns  cuModuleGetLoadingMode
                    0.00%     360ns         1     360ns     360ns     360ns  cuDeviceGetUuid

==3275921== Unified Memory profiling result:
Device "NVIDIA TITAN X (Pascal) (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
      97  84.453KB  4.0000KB  0.9922MB  8.000000MB  803.9990us  Host To Device
      24  170.67KB  4.0000KB  0.9961MB  4.000000MB  340.3340us  Device To Host
      16         -         -         -           -  2.437333ms  Gpu page fault groups
Total CPU Page faults: 36

This type of loop in a CUDA kernel is often called a grid-stride loop.

Version	Time	Bandwith
1 CUDA Thread	174.22ms	Cell
1 CUDA Block	2.5162ms	Cell
Many CUDA Blocks	2.1255ms	Cell
Many GPUs	Cell	Cell

Summary

A block consists many threads. In our case, block_dim == block_size == num_threads = 256.

Similarly, a grid consists many blocks. In our case, grid_dim == grid_size = 4096.

The CUDA architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.

Implementing CUDA Unified Memory in the PyTorch Framework

chufanchen / read-paper-and-code

NVIDIA | An Even Easier Introduction to CUDA #150

Summary