Open jacobhinkle opened 9 months ago
For example, we see the following trace
$ nsys nvprof --print-gpu-trace build/nvfuser_bench --benchmark_filter=NvFuserScheduler_Matmul_Manual/nvfuser_splitk_TN/M:1024/N:2048/K:50304/warps:4/stages:3/splitk_factor:2/manual_time
Start (ns) Duration (ns) CorrId GrdX GrdY GrdZ BlkX BlkY BlkZ Reg/Trd StcSMem (MB) DymSMem (MB) Bytes (MB) Throughput (MB/s) SrcMemKd DstMemKd Device Ctx GreenCtx Strm Name
---------- ------------- ------ ----- ---- ---- ---- ---- ---- ------- ------------ ------------ ---------- ----------------- -------- -------- ------------------------- --- -------- ---- ----------------------------------------------------------------------------------------------------
...
5738218680 2048 31462 1 1 1 128 1 1 16 0.000 0.000 NVIDIA A100 80GB PCIe (0) 1 7 void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<long>, at::detail::A…
5738245336 2461850 31475 8 16 2 32 2 2 240 0.000 0.049 NVIDIA A100 80GB PCIe (0) 1 7 <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)2, (int)2>, <unnamed>…
...
The first kernel is at::zeros()
. Including the latency between these two launches, this introduces 5738245336 - 5738218680 = 26656 ns, which is 1.1% of the runtime of the actual kernel (2.46 ms). This is not a particularly small problem (we commonly have kernels with runtimes of 100-200 us) and since the latency penalty is fixed the impact can be larger for smaller problems. This can negate the main benefit of single-kernel split-K vs two-kernel: removing a separate kernel launch. It's worth noticing that cuBLAS reuses zeroed workspace memory, so their single-kernel split-K traces include a single kernel:
$ sys nvprof --print-gpu-trace build/nvfuser_bench --benchmark_filter=Baseline_Matmul/eagermode_.*_TN/M:1024/N:2048/K:50304/half_reduction:1/manual_time
Start (ns) Duration (ns) CorrId GrdX GrdY GrdZ BlkX BlkY BlkZ Reg/Trd StcSMem (MB) DymSMem (MB) Bytes (MB) Throughput (MB/s) SrcMemKd DstMemKd Device Ctx GreenCtx Strm Name
---------- ------------- ------ ----- ---- ---- ---- ---- ---- ------- ------------ ------------ ---------- ----------------- -------- -------- ------------------------- --- -------- ---- ----------------------------------------------------------------------------------------------------
...
1704585906 1069789 59125 8 8 3 256 1 1 238 0.049 0.098 NVIDIA A100 80GB PCIe (0) 1 7 ampere_fp16_s16816gemm_fp16_256x128_ldg8_f2f_stages_64x3_tn
...
(disregard timing of the kernel itself since cublas is using half-precision reduction in this case, see #1719)
This is complete for split-K matmul (serial grid reduction). However, we still launch a memset kernel before doing grid reductions. I have attempted to modify gridReduce
to clean up the semaphore but have not yet found a working approach that doesn't deadlock. Until then I'm leaving this open.
We currently need zeroed global memory buffers for cross-cta communication. Our current executor calls
at::zeros
to initialize this before each launch of our nvfuser kernel, adding a handful of microseconds. Instead, each executor, kernel runtime, or maybe each process could hold one zeroed buffer (per device) and reuse it without the memset. This would require us to always clean up our semaphores after each use like we do for persistent kernels.