Open xiaohuguo2023 opened 3 months ago
This update is aiming to reduce LDS usage to remove large tile size limitation for stream-k when using atomic_add.
and it helps to remove 84
%83 = arith.truncf %71 : tensor<256x256xf32, #mfma> to tensor<256x256xf16, #mfma> %84 = triton_gpu.convert_layout %83 : (tensor<256x256xf16, #mfma>) -> tensor<256x256xf16, #blocked> %85 = "tt.atomic_rmw"(%82, %84, %cst_0) <{atomic_rmw_op = 5 : i32, scope = 1 : i32, sem = 4 : i32}> : (tensor<256x256x!tt.ptr<f16, 1>, #blocked>, tensor<256x256xf16, #blocked>, tensor<256x256xi1, #blocked>) -> tensor<256x256xf16, #blocked>
to
%87 = arith.truncf %75 : tensor<256x256xf32, #mfma> to tensor<256x256xf16, #mfma> %88 = "tt.atomic_rmw"(%86, %87, %cst) <{atomic_rmw_op = 5 : i32, scope = 1 : i32, sem = 4 : i32}> : (tensor<256x256x!tt.ptr<f16, 1>, #mfma>, tensor<256x256xf16, #mfma>, tensor<256x256xi1, #mfma>) -> tensor<256x256xf16, #mfma> }
This update is aiming to reduce LDS usage to remove large tile size limitation for stream-k when using atomic_add.
and it helps to remove 84
to