use cudaOccupancyMaxPotentialBlockSize in LaunchConfig::for_num_elems

ariasanovsky commented 1 year ago

for_num_elements defaults to block size 1024 but this is often suboptimal for performance. See NVIDIA's article on optimal number of blocks and threads.

cudaOccupancyMaxPotentialBlockSize makes it possible to compute a reasonably efficient execution configuration for a kernel without having to directly query the kernel’s attributes or the device properties, regardless of what device is present or any compilation details

The CUDA Toolkit version 6.5 also provides a self-documenting, standalone occupancy calculator and launch configurator implementation in /include/cuda_occupancy.h for any use cases that cannot depend on the CUDA software stack. A spreadsheet version of the occupancy calculator is also included (and has been for many CUDA releases). The spreadsheet version is particularly useful as a learning tool that visualizes the impact of changes to the parameters that affect occupancy (block size, registers per thread, and shared memory per thread). You can find more information in the CUDA C Programming Guide and CUDA Runtime API Reference.

For example in C++ CUDA:

#include "stdio.h"

__global__ void MyKernel(int *array, int arrayCount) 
{ 
  int idx = threadIdx.x + blockIdx.x * blockDim.x; 
  if (idx < arrayCount) 
  { 
    array[idx] *= array[idx]; 
  } 
} 

void launchMyKernel(int *array, int arrayCount) 
{ 
  int blockSize;   // The launch configurator returned block size 
  int minGridSize; // The minimum grid size needed to achieve the 
                   // maximum occupancy for a full device launch 
  int gridSize;    // The actual grid size needed, based on input size 

  cudaOccupancyMaxPotentialBlockSize( &minGridSize, &blockSize, 
                                      MyKernel, 0, 0); 
  // Round up according to array size 
  gridSize = (arrayCount + blockSize - 1) / blockSize; 

  MyKernel<<< gridSize, blockSize >>>(array, arrayCount); 

  cudaDeviceSynchronize();

coreylowman commented 1 year ago

Good idea! We can use this in dfdx, it will be much more optimal 😄

brandonros commented 5 months ago

Example in case anybody else lands on this:

extern "C" fn block_size_to_dynamic_smem_size(_block_size: std::ffi::c_int) -> usize {
    0
}

// get kernel
let my_kernel = device.get_func("my_module", "my_kernel").unwrap();

// get size
let (min_grid_size, min_block_size) = my_kernel.occupancy_max_potential_block_size(block_size_to_dynamic_smem_size, 0, 0, None)?;
log::info!("min_grid_size = {min_grid_size} min_block_size = {min_block_size}");

coreylowman / cudarc

use cudaOccupancyMaxPotentialBlockSize in LaunchConfig::for_num_elems #149