NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
Other
275 stars 53 forks source link

Reusable zeroed memory #1829

Open jacobhinkle opened 9 months ago

jacobhinkle commented 9 months ago

We currently need zeroed global memory buffers for cross-cta communication. Our current executor calls at::zeros to initialize this before each launch of our nvfuser kernel, adding a handful of microseconds. Instead, each executor, kernel runtime, or maybe each process could hold one zeroed buffer (per device) and reuse it without the memset. This would require us to always clean up our semaphores after each use like we do for persistent kernels.

jacobhinkle commented 8 months ago

For example, we see the following trace

$ nsys nvprof --print-gpu-trace build/nvfuser_bench --benchmark_filter=NvFuserScheduler_Matmul_Manual/nvfuser_splitk_TN/M:1024/N:2048/K:50304/warps:4/stages:3/splitk_factor:2/manual_time
 Start (ns)  Duration (ns)  CorrId  GrdX   GrdY  GrdZ  BlkX  BlkY  BlkZ  Reg/Trd  StcSMem (MB)  DymSMem (MB)  Bytes (MB)  Throughput (MB/s)  SrcMemKd  DstMemKd           Device            Ctx  GreenCtx  Strm                                                  Name
 ----------  -------------  ------  -----  ----  ----  ----  ----  ----  -------  ------------  ------------  ----------  -----------------  --------  --------  -------------------------  ---  --------  ----  ----------------------------------------------------------------------------------------------------
...
 5738218680           2048   31462       1     1     1   128     1     1       16         0.000         0.000                                                     NVIDIA A100 80GB PCIe (0)    1               7  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<long>, at::detail::A…
 5738245336        2461850   31475       8    16     2    32     2     2      240         0.000         0.049                                                     NVIDIA A100 80GB PCIe (0)    1               7  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)2, (int)2>, <unnamed>…
...

The first kernel is at::zeros(). Including the latency between these two launches, this introduces 5738245336 - 5738218680 = 26656 ns, which is 1.1% of the runtime of the actual kernel (2.46 ms). This is not a particularly small problem (we commonly have kernels with runtimes of 100-200 us) and since the latency penalty is fixed the impact can be larger for smaller problems. This can negate the main benefit of single-kernel split-K vs two-kernel: removing a separate kernel launch. It's worth noticing that cuBLAS reuses zeroed workspace memory, so their single-kernel split-K traces include a single kernel:

$ sys nvprof --print-gpu-trace build/nvfuser_bench --benchmark_filter=Baseline_Matmul/eagermode_.*_TN/M:1024/N:2048/K:50304/half_reduction:1/manual_time 
 Start (ns)  Duration (ns)  CorrId  GrdX   GrdY  GrdZ  BlkX  BlkY  BlkZ  Reg/Trd  StcSMem (MB)  DymSMem (MB)  Bytes (MB)  Throughput (MB/s)  SrcMemKd  DstMemKd           Device            Ctx  GreenCtx  Strm                                                  Name
 ----------  -------------  ------  -----  ----  ----  ----  ----  ----  -------  ------------  ------------  ----------  -----------------  --------  --------  -------------------------  ---  --------  ----  ----------------------------------------------------------------------------------------------------
...
 1704585906        1069789   59125     8     8     3   256     1     1      238         0.049         0.098                                                     NVIDIA A100 80GB PCIe (0)    1               7  ampere_fp16_s16816gemm_fp16_256x128_ldg8_f2f_stages_64x3_tn  
...

(disregard timing of the kernel itself since cublas is using half-precision reduction in this case, see #1719)

jacobhinkle commented 1 month ago

This is complete for split-K matmul (serial grid reduction). However, we still launch a memset kernel before doing grid reductions. I have attempted to modify gridReduce to clean up the semaphore but have not yet found a working approach that doesn't deadlock. Until then I'm leaving this open.