if (elect-sync) {
arriveExpectTx
TMA
} else {
arrive
}
Now:
if (elect-sync) {
arriveExpectTx
TMA
}
I am very surprised that this fixes all the latencies introduced in the elect-sync fix https://github.com/NVIDIA/Fuser/pull/3295, and even better! But in general, we should sync as less as possible, and avoid unnecessary wait, so I think this PR makes sense.
Perf:
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns)
Name
-------- --------------- --------- -------- -------- -------- -------- ----------- ----------------------------------------------------------------------------------------------------
39.0 172735 1 172735.0 172735.0 172735 172735 0.0 <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
20.0 88768 1 88768.0 88768.0 88768 88768 0.0 nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT
Previously:
Now:
I am very surprised that this fixes all the latencies introduced in the elect-sync fix https://github.com/NVIDIA/Fuser/pull/3295, and even better! But in general, we should sync as less as possible, and avoid unnecessary wait, so I think this PR makes sense.
Perf:
Perf nvFuser/cuBLAS:
51.4%
.