Open ariasanovsky opened 1 year ago
Good idea! We can use this in dfdx, it will be much more optimal 😄
Example in case anybody else lands on this:
extern "C" fn block_size_to_dynamic_smem_size(_block_size: std::ffi::c_int) -> usize {
0
}
// get kernel
let my_kernel = device.get_func("my_module", "my_kernel").unwrap();
// get size
let (min_grid_size, min_block_size) = my_kernel.occupancy_max_potential_block_size(block_size_to_dynamic_smem_size, 0, 0, None)?;
log::info!("min_grid_size = {min_grid_size} min_block_size = {min_block_size}");
for_num_elements defaults to block size 1024 but this is often suboptimal for performance. See NVIDIA's article on optimal number of blocks and threads.
For example in C++ CUDA: