ROCm / triton

Development repository for the Triton language and compiler
MIT License
80 stars 22 forks source link

support bypassing data layout conversion for atomic operator #556

Open xiaohuguo2023 opened 3 months ago

xiaohuguo2023 commented 3 months ago

This update is aiming to reduce LDS usage to remove large tile size limitation for stream-k when using atomic_add.

  1. add support for both StoreOp and AtomicRMWOp

and it helps to remove 84

 %83 = arith.truncf %71 : tensor<256x256xf32, #mfma> to tensor<256x256xf16, #mfma>
 %84 = triton_gpu.convert_layout %83 : (tensor<256x256xf16, #mfma>) -> tensor<256x256xf16, #blocked>
          %85 = "tt.atomic_rmw"(%82, %84, %cst_0) <{atomic_rmw_op = 5 : i32, scope = 1 : i32, sem = 4 : i32}> : (tensor<256x256x!tt.ptr<f16, 1>, #blocked>, tensor<256x256xf16, #blocked>, tensor<256x256xi1, #blocked>) -> tensor<256x256xf16, #blocked>

to

 %87 = arith.truncf %75 : tensor<256x256xf32, #mfma> to tensor<256x256xf16, #mfma>
 %88 = "tt.atomic_rmw"(%86, %87, %cst) <{atomic_rmw_op = 5 : i32, scope = 1 : i32, sem = 4 : i32}> : (tensor<256x256x!tt.ptr<f16, 1>, #mfma>, tensor<256x256xf16, #mfma>, tensor<256x256xi1, #mfma>) -> tensor<256x256xf16, #mfma>
        }