Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns)
Name
-------- --------------- --------- -------- -------- -------- -------- ----------- ----------------------------------------------------------------------------------------------------
47.8 247326 1 247326.0 247326.0 247326 247326 0.0 <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
17.0 88191 1 88191.0 88191.0 88191 88191 0.0 nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT
Perf nvFuser/cuBLAS: 35.6%
Strangely, elect-sync hurt instead of help perf. I need to look into this, but anyway, this PR is a bug fix, not a perf improvement. If elect-sync does not work, we should disable it, instead of enabling it and rely on a bug to avoid it hurting perf.
This PR fixes https://github.com/NVIDIA/Fuser/issues/3199
Perf:
Perf nvFuser/cuBLAS:
35.6%
Strangely, elect-sync hurt instead of help perf. I need to look into this, but anyway, this PR is a bug fix, not a perf improvement. If elect-sync does not work, we should disable it, instead of enabling it and rely on a bug to avoid it hurting perf.
Generated code
```CUDA __global__ void nvfuser_none_f0_c0_r0_g0(Tensor<__half, 3, 3> T0, Tensor<__half, 3, 3> T1, const __grid_constant__ TensorMap var0, const __grid_constant__ TensorMap var1, Tensor<__half, 2, 2> T3) { alignas(16) extern __shared__ char array[]; const unsigned smem_offset = 0; nvfuser_index_t i2; i2 = ceilDiv(T0.logical_size[0LL], 16); nvfuser_index_t i3; i3 = -3 + i2; const TensorMap* ptr4; ptr4 = &var0; nvfuser_index_t i5; i5 = 256 * ((nvfuser_index_t)blockIdx.x); __half* T5 = reinterpret_cast<__half*>(array + smem_offset + 16512); unsigned i6; i6 = toSmem(T5); const TensorMap* ptr7; ptr7 = &var1; nvfuser_index_t i8; i8 = 128 * ((nvfuser_index_t)blockIdx.y); __half* T4 = reinterpret_cast<__half*>(array + smem_offset + 128); unsigned i9; i9 = toSmem(T4); unsigned i10; i10 = i9 + (2048 * ((nvfuser_index_t)threadIdx.y)); nvfuser_index_t i11; i11 = ((nvfuser_index_t)threadIdx.x) / 4; nvfuser_index_t i12; i12 = 2 * (((nvfuser_index_t)threadIdx.x) % 4); nvfuser_index_t i13; i13 = i11 / 8; nvfuser_index_t i14; i14 = i11 % 8; nvfuser_index_t i15; i15 = ((((i12 + ((16 * T1.logical_size[2LL]) * i13)) + (T1.logical_size[2LL] * i14)) + ((64 * T1.logical_size[2LL]) * ((nvfuser_index_t)threadIdx.y))) + i5) + ((128 * T1.logical_size[2LL]) * ((nvfuser_index_t)blockIdx.y)); nvfuser_index_t i16; i16 = 8 * T1.logical_size[2LL]; bool b17; b17 = ((((nvfuser_index_t)threadIdx.x) < 32ULL) && (((nvfuser_index_t)threadIdx.y) == 0ULL)) && (((nvfuser_index_t)threadIdx.z) == 0ULL); nvfuser_index_t i18; i18 = ((1 - T1.logical_size[2LL]) + i12) + i5; nvfuser_index_t i19; i19 = ((((-T0.logical_size[1LL]) + (16 * i13)) + i14) + (64 * ((nvfuser_index_t)threadIdx.y))) + i8; float T2[128]; ((*reinterpret_cast