Tracking perf optimization of `HopperMatmulTest.HSH_NT_128BSwizzle` for problem size `(M=2048, N=2048, K=8192)`, CTA tile size `(128, 256)`

NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")

Other

271 stars 53 forks source link

Tracking perf optimization of `HopperMatmulTest.HSH_NT_128BSwizzle` for problem size `(M=2048, N=2048, K=8192)`, CTA tile size `(128, 256)` #3279

Open zasdfgbnm opened 3 weeks ago

zasdfgbnm commented 3 weeks ago

The CTA tile size (128, 256) is a size that can relatively easily achieve high math throughput. The problem size is carefully selected as one full wave. I believe this is a good incremental task.

Benchmark command:

nsys nvprof ./bin/test_matmul --gtest_filter=HopperMatmulTest.HSH_NT_128BSwizzle

Current perf on H200 on main as in the latest comment:

Perf:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name

 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     36.0           151775          1  151775.0  151775.0    151775    151775          0.0  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
     20.7            87135          1   87135.0   87135.0     87135     87135          0.0  nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT

nvFuser/cuBLAS = 57.4%.

zasdfgbnm commented 3 weeks ago

Initial perf as measured in https://github.com/NVIDIA/Fuser/pull/3281:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name

 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     43.2           205150          1  205150.0  205150.0    205150    205150          0.0  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
     18.5            87550          1   87550.0   87550.0     87550     87550          0.0  nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT

nvFuser/cuBLAS = 42.7%

zasdfgbnm commented 3 weeks ago

There is a perf regression after the fix of elect-sync: https://github.com/NVIDIA/Fuser/pull/3295

Perf:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)
 Name
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     47.8           247326          1  247326.0  247326.0    247326    247326          0.0  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
     17.0            88191          1   88191.0   88191.0     88191     88191          0.0  nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT

Perf nvFuser/cuBLAS: 35.6%

zasdfgbnm commented 3 weeks ago

After https://github.com/NVIDIA/Fuser/pull/3294:

Perf:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)
 Name
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     39.0           172735          1  172735.0  172735.0    172735    172735          0.0  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
     20.0            88768          1   88768.0   88768.0     88768     88768          0.0  nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT

Perf nvFuser/cuBLAS: 51.4%.

zasdfgbnm commented 3 weeks ago

After https://github.com/NVIDIA/Fuser/pull/3314

Perf:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name

 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     36.0           151775          1  151775.0  151775.0    151775    151775          0.0  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
     20.7            87135          1   87135.0   87135.0     87135     87135          0.0  nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT

nvFuser/cuBLAS = 57.4%.