ROCm / triton

Development repository for the Triton language and compiler
MIT License
96 stars 29 forks source link

streamk atomics fix #632

Closed xiaohuguo2023 closed 3 months ago

xiaohuguo2023 commented 3 months ago

streamk gemm kernel is using spinning lock to implement multiple buffer method to replace atomic_add,

The PR 4431 cause data racing when using atomics_xchg and atomics_cas together to implement [spinning lock.] atomic cas uses shared memory but atomics_xchg doesn't.(https://github.com/ROCm/triton/blob/624335ff569562d5db26bea337e3c6de2bd6b0dc/python/perf-kernels/streamk/streamk_kernel.py#L173C12-L205C1)

In Triton, atomic operations are performed at the block level, where each block can consist of multiple waves. The purpose of adding synchronization is to ensure that waves wait until the current wave has completed its execution.