streamk atomics fix - Githubissues

streamk gemm kernel is using spinning lock to implement multiple buffer method to replace atomic_add,

The PR 4431 cause data racing when using atomics_xchg and atomics_cas together to implement [spinning lock.] atomic cas uses shared memory but atomics_xchg doesn't.(https://github.com/ROCm/triton/blob/624335ff569562d5db26bea337e3c6de2bd6b0dc/python/perf-kernels/streamk/streamk_kernel.py#L173C12-L205C1)

In Triton, atomic operations are performed at the block level, where each block can consist of multiple waves. The purpose of adding synchronization is to ensure that waves wait until the current wave has completed its execution.

ROCm / triton

streamk atomics fix #632