Profiling numba+numbast versus CUDA C++ kernel calls

charlesbluca commented 6 months ago

Quick summary of some light exploration I've done profiling numba+numbast versus raw CUDA C++ kernels, as motivated by #12; put together a minimal version of one of the tests:

import numba.cuda as cuda
import numpy as np
from numba import float32
from fp16 import (
    half,
    get_shims,
)

@cuda.jit(link=get_shims())
def simple_kernel(arith):
    # Binary Arithmetic Operators
    a = half(1.0)
    b = half(2.0)
    arith[0] = float32(a + b)
    arith[1] = float32(a - b)
    arith[2] = float32(a * b)
    arith[3] = float32(a / b)

arith = np.zeros(4, dtype=np.float32)
simple_kernel[1, 1](arith)

And my best approximation of the equivalent raw CUDA C++ kernel:

#include <stdio.h>
#include <cuda_fp16.h>

__global__ void simple_kernel(float* arith) {
    // Binary Arithmetic Operators
    half a = __float2half(1.0f);
    half b = __float2half(2.0f);
    arith[0] = __half2float(__hadd(a, b));
    arith[1] = __half2float(__hsub(a, b));
    arith[2] = __half2float(__hmul(a, b));
    arith[3] = __half2float(__hdiv(a, b));
}

int main(void) {
    int N = 4;
    float *arith, *arith_d;
    arith = (float*)malloc(N*sizeof(float));

    cudaMalloc(&arith_d, N*sizeof(float));

    for (int i = 0; i < N; i++) {
        arith[i] = 0.0f;
    }
    cudaMemcpy(arith_d, arith, N*sizeof(float), cudaMemcpyHostToDevice);

    simple_kernel<<<1, 1>>>(arith_d);
    cudaDeviceSynchronize();

    cudaMemcpy(arith, arith_d, N*sizeof(float), cudaMemcpyDeviceToHost);

    cudaFree(arith_d);
    free(arith);

    return 0;
}

Compiled like so:

→ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0
→ nvcc --gpu-architecture sm_70 arithmetic.cu -o arithmetic

Then ran these scripts through nvprof and nsys:

→ nvprof python arithmetic.py
==3479503== NVPROF is profiling process 3479503, command: python arithmetic.py
/raid/charlesb/miniforge3/envs/numbast-cuda124/lib/python3.10/site-packages/numba/cuda/dispatcher.py:536: NumbaPerformanceWarning: Grid size 1 will likely result in GPU under-utilization due to low occupancy.
  warn(NumbaPerformanceWarning(msg))
/raid/charlesb/miniforge3/envs/numbast-cuda124/lib/python3.10/site-packages/numba/cuda/cudadrv/devicearray.py:886: NumbaPerformanceWarning: Host array used in CUDA kernel will incur copy overhead to/from device.
  warn(NumbaPerformanceWarning(msg))
==3479503== Profiling application: python arithmetic.py
==3479503== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   65.96%  8.0630us         1  8.0630us  8.0630us  8.0630us  _ZN6cudapy8__main__13simple_kernelB2v1B94cw51cXTLSUwv1sCUt9Uw11Ew0NRRQPKzLTg4gaGKFsG2oMQGEYakJSQB1PQBk0Bynm21OiwU1a0UoLGhDpQE8oxrNQE_3dE5ArrayIfLi1E1C7mutable7alignedE
                   18.85%  2.3040us         1  2.3040us  2.3040us  2.3040us  [CUDA memcpy DtoH]
                   15.19%  1.8570us         1  1.8570us  1.8570us  1.8570us  [CUDA memcpy HtoD]
      API calls:   83.94%  294.54ms         1  294.54ms  294.54ms  294.54ms  cuDevicePrimaryCtxRetain
                   15.48%  54.299ms         1  54.299ms  54.299ms  54.299ms  cuModuleLoadDataEx
                    0.32%  1.1213ms         1  1.1213ms  1.1213ms  1.1213ms  cuMemAlloc
                    0.15%  512.11us         8  64.013us  58.245us  75.375us  cuDeviceGetName
                    0.03%  98.009us         1  98.009us  98.009us  98.009us  cuLaunchKernel
                    0.02%  52.866us        28  1.8880us     199ns  31.235us  cuCtxGetCurrent
                    0.01%  50.748us         1  50.748us  50.748us  50.748us  cuMemcpyDtoH
                    0.01%  49.557us         1  49.557us  49.557us  49.557us  cuMemcpyHtoD
                    0.01%  40.905us         1  40.905us  40.905us  40.905us  cuMemGetInfo
                    0.01%  27.448us         8  3.4310us  1.7720us  9.4650us  cuDeviceGetPCIBusId
                    0.01%  22.574us        27     836ns     143ns  9.0930us  cuCtxGetDevice
                    0.01%  19.048us        40     476ns     288ns  1.4420us  cuDeviceGetAttribute
                    0.00%  12.276us         1  12.276us  12.276us  12.276us  cuInit
                    0.00%  8.9870us         1  8.9870us  8.9870us  8.9870us  cuModuleGetFunction
                    0.00%  8.3660us        16     522ns     328ns  1.1890us  cuDeviceGet
                    0.00%  4.7950us         8     599ns     518ns     953ns  cuDeviceGetUuid
                    0.00%  2.4370us         5     487ns     195ns  1.1600us  cuFuncGetAttribute
                    0.00%  2.3550us         3     785ns     327ns  1.1790us  cuDeviceGetCount
                    0.00%  1.9390us         1  1.9390us  1.9390us  1.9390us  cuCtxPushCurrent

→ nvprof ./arithmetic
==3555075== NVPROF is profiling process 3555075, command: ./arithmetic
==3555075== Profiling application: ./arithmetic
==3555075== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   38.08%  2.4010us         1  2.4010us  2.4010us  2.4010us  [CUDA memcpy DtoH]
                   34.51%  2.1760us         1  2.1760us  2.1760us  2.1760us  simple_kernel(float*)
                   27.41%  1.7280us         1  1.7280us  1.7280us  1.7280us  [CUDA memcpy HtoD]
      API calls:   98.28%  258.71ms         1  258.71ms  258.71ms  258.71ms  cudaMalloc
                    1.32%  3.4827ms       912  3.8180us      92ns  262.76us  cuDeviceGetAttribute
                    0.19%  495.19us         1  495.19us  495.19us  495.19us  cudaFree
                    0.12%  304.00us         8  37.999us  35.296us  51.093us  cuDeviceGetName
                    0.05%  136.41us         1  136.41us  136.41us  136.41us  cuLibraryLoadData
                    0.02%  41.551us         2  20.775us  19.395us  22.156us  cudaMemcpy
                    0.01%  31.832us         1  31.832us  31.832us  31.832us  cudaLaunchKernel
                    0.01%  17.511us         8  2.1880us  1.0210us  9.3700us  cuDeviceGetPCIBusId
                    0.00%  5.0380us         1  5.0380us  5.0380us  5.0380us  cudaDeviceSynchronize
                    0.00%  2.4740us        16     154ns      97ns     543ns  cuDeviceGet
                    0.00%  2.1330us         8     266ns     167ns     520ns  cuDeviceTotalMem
                    0.00%  1.0950us         8     136ns     100ns     212ns  cuDeviceGetUuid
                    0.00%     907ns         3     302ns      98ns     637ns  cuDeviceGetCount
                    0.00%     256ns         1     256ns     256ns     256ns  cuModuleGetLoadingMode

Some things @quasiben and I noticed looking at these profiles:

The raw kernel didn’t make a call to cuInit
Beyond the initial overhead, the Numba kernel call itself takes roughly 4x as long as the raw kernel call

Would like to do some more exploration here and will probably take a look at numba-inspector and cuda.compile_ptx_for_current_device to do so.

isVoid commented 6 months ago

Thanks - I'm interested in the overhead that Numbast currently runs into. It's also worth printing out the PTXes and compare both kernels. You can do NUMBA_DUMP_ASSEMBLY to see for Numba. And --ptx flag for nvcc. I think part of the overhead could come from the foreign function call in Numba kernel (and hopefully should mitigate by LTO support, but not sure if it can save up to 4X).

Nsight compute could also explain how much time the kernel spend on each instruction.

isVoid commented 6 months ago

Additionally, I also encourage you to make a PR for the above benchmarking scripts into the repo. I think numbast/benchmarks is a good place for it.

NVIDIA / numbast

Profiling numba+numbast versus CUDA C++ kernel calls #46