Open zasdfgbnm opened 3 weeks ago
Initial perf as measured in https://github.com/NVIDIA/Fuser/pull/3281:
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- -------- -------- -------- -------- ----------- ----------------------------------------------------------------------------------------------------
43.2 205150 1 205150.0 205150.0 205150 205150 0.0 <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
18.5 87550 1 87550.0 87550.0 87550 87550 0.0 nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT
nvFuser/cuBLAS = 42.7%
There is a perf regression after the fix of elect-sync: https://github.com/NVIDIA/Fuser/pull/3295
Perf:
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns)
Name
-------- --------------- --------- -------- -------- -------- -------- ----------- ----------------------------------------------------------------------------------------------------
47.8 247326 1 247326.0 247326.0 247326 247326 0.0 <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
17.0 88191 1 88191.0 88191.0 88191 88191 0.0 nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT
Perf nvFuser/cuBLAS: 35.6%
After https://github.com/NVIDIA/Fuser/pull/3294:
Perf:
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns)
Name
-------- --------------- --------- -------- -------- -------- -------- ----------- ----------------------------------------------------------------------------------------------------
39.0 172735 1 172735.0 172735.0 172735 172735 0.0 <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
20.0 88768 1 88768.0 88768.0 88768 88768 0.0 nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT
Perf nvFuser/cuBLAS: 51.4%
.
After https://github.com/NVIDIA/Fuser/pull/3314
Perf:
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- -------- -------- -------- -------- ----------- ----------------------------------------------------------------------------------------------------
36.0 151775 1 151775.0 151775.0 151775 151775 0.0 <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
20.7 87135 1 87135.0 87135.0 87135 87135 0.0 nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT
nvFuser/cuBLAS = 57.4%
.
The CTA tile size
(128, 256)
is a size that can relatively easily achieve high math throughput. The problem size is carefully selected as one full wave. I believe this is a good incremental task.Benchmark command:
Current perf on H200 on main as in the latest comment:
Perf:
nvFuser/cuBLAS =
57.4%
.