NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
Other
271 stars 53 forks source link

Only the TMA thread arrive #3294

Closed zasdfgbnm closed 3 weeks ago

zasdfgbnm commented 4 weeks ago

Previously:

if (elect-sync) {
  arriveExpectTx
  TMA
} else {
  arrive
}

Now:

if (elect-sync) {
  arriveExpectTx
  TMA
}

I am very surprised that this fixes all the latencies introduced in the elect-sync fix https://github.com/NVIDIA/Fuser/pull/3295, and even better! But in general, we should sync as less as possible, and avoid unnecessary wait, so I think this PR makes sense.

Perf:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)
 Name
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     39.0           172735          1  172735.0  172735.0    172735    172735          0.0  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
     20.0            88768          1   88768.0   88768.0     88768     88768          0.0  nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT

Perf nvFuser/cuBLAS: 51.4%.

zasdfgbnm commented 4 weeks ago

!build