Occupancy calculation using NVBit

NVlabs / NVBit

200 stars 18 forks source link

Occupancy calculation using NVBit #2

Closed CoffeeBeforeArch closed 4 years ago

CoffeeBeforeArch commented 4 years ago

Is there any way to call cudaOccupancyMaxActiveBlocksPerMultiprocessor within an NVBit tool on kernels being launched? I believe I can get the block size and dynamic shmem size, but I also need to get the kernel function to pass to the call. Is it possible to get that?

ovilla commented 4 years ago

you can call any cuda* API inside the

void nvbit_at_cuda_event(CUcontext ctx, int is_exit, nvbit_api_cuda_t cbid,
                         const char *name, void *params, CUresult *pStatus)

however you have to be careful that the call itself does not get a callback. I typically do that by setting a flag before calling the cuda API.

See for instance the example tools/mem_trace/mem_trace.cu in which a skip_flag is used to avoid "infinite" recursion of the nvbit_at_cuda_event call.

The kernel function (cuFunction) is already part of the parameters from the kernel launch callback.

cuLaunchKernel_params *p = (cuLaunchKernel_params *)params;
p->f  <<<--- this is the kernel function

CoffeeBeforeArch commented 4 years ago

Thanks for the reply! I have the looping handled, and I've used other CUDA API calls in tools. I had already tried passing a cuFunction to the driver API already, but my initial impression is that the occupancy API is expecting a different type.

For a call like:

CUDA_SAFECALL(cudaOccupancyMaxActiveBlocksPerMultiprocessor(&blocks, p->f, b_size, d_shmem));

I get the following error:

Cuda error in function 'cudaOccupancyMaxActiveBlocksPerMultiprocessor(&blocks, p->f, b_size, d_shmem)' file 'occupancy_calc.cu' in line 195 : invalid device function.

Any thoughts?

ovilla commented 4 years ago

yes, you should be calling cuOccupancyMaxActiveBlocksPerMultiprocessor instead of cudaOccupancyMaxActiveBlocksPerMultiprocessor

cu functions take CUfunctions and are driver level API cuda functions take higher level representation and they are runtime level API

https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__OCCUPANCY.html#group__CUDA__OCCUPANCY_1gcc6e1094d05cba2cee17fe33ddd04a98

CoffeeBeforeArch commented 4 years ago

Ah, perfect! That makes sense. I must have missed the driver-level version. Everything works now. I'll go ahead and close the issue. Thank you for the help!