support bypassing data layout conversion for atomic operator

This update is aiming to reduce LDS usage to remove large tile size limitation for stream-k when using atomic_add.

add support for both StoreOp and AtomicRMWOp

and it helps to remove 84

 %83 = arith.truncf %71 : tensor<256x256xf32, #mfma> to tensor<256x256xf16, #mfma>
 %84 = triton_gpu.convert_layout %83 : (tensor<256x256xf16, #mfma>) -> tensor<256x256xf16, #blocked>
          %85 = "tt.atomic_rmw"(%82, %84, %cst_0) <{atomic_rmw_op = 5 : i32, scope = 1 : i32, sem = 4 : i32}> : (tensor<256x256x!tt.ptr<f16, 1>, #blocked>, tensor<256x256xf16, #blocked>, tensor<256x256xi1, #blocked>) -> tensor<256x256xf16, #blocked>

 %87 = arith.truncf %75 : tensor<256x256xf32, #mfma> to tensor<256x256xf16, #mfma>
 %88 = "tt.atomic_rmw"(%86, %87, %cst) <{atomic_rmw_op = 5 : i32, scope = 1 : i32, sem = 4 : i32}> : (tensor<256x256x!tt.ptr<f16, 1>, #mfma>, tensor<256x256xf16, #mfma>, tensor<256x256xi1, #mfma>) -> tensor<256x256xf16, #mfma>
        }

ROCm / triton

support bypassing data layout conversion for atomic operator #556