Open charlesbluca opened 6 months ago
Thanks - I'm interested in the overhead that Numbast currently runs into. It's also worth printing out the PTXes and compare both kernels. You can do NUMBA_DUMP_ASSEMBLY
to see for Numba. And --ptx
flag for nvcc. I think part of the overhead could come from the foreign function call in Numba kernel (and hopefully should mitigate by LTO support, but not sure if it can save up to 4X).
Nsight compute could also explain how much time the kernel spend on each instruction.
Additionally, I also encourage you to make a PR for the above benchmarking scripts into the repo. I think numbast/benchmarks
is a good place for it.
Quick summary of some light exploration I've done profiling numba+numbast versus raw CUDA C++ kernels, as motivated by #12; put together a minimal version of one of the tests:
And my best approximation of the equivalent raw CUDA C++ kernel:
Compiled like so:
Then ran these scripts through
nvprof
andnsys
:Some things @quasiben and I noticed looking at these profiles:
cuInit
Would like to do some more exploration here and will probably take a look at numba-inspector and
cuda.compile_ptx_for_current_device
to do so.