Closed QiZhangNV closed 1 week ago
Thanks, I think you're right. We might have fixed it in this branch (decode) which we're working on merging to main https://github.com/Dao-AILab/flash-attention/blob/decode/hopper/epilogue_fwd_sm90_tma.hpp
That’s great, it looks good to me.
Hello, I'd like to report a potential hazard that occurs in the epilogue when
kUseVarSeqLen=true
. Under this condition, it utilizeswrite_tiled()
instead ofwrite_tma()
to writeO
to the gmem. This implies that all threads are responsible for issuing a copy from smem to gmem. However, it appears that these threads are not synchronized prior to the copy operation.To consistently reproduce this bug, insert the following code before the
cute::copy(rmem_tiled_copy_O, taccOrO, taccOsO)
at https://github.com/Dao-AILab/flash-attention/blob/main/hopper/epilogue_fwd_sm90_tma.hpp#L200. This makes the last warp copy from rmem to smem much slower than the others. Due to the lack of synchronization, the data are copied to gmem before the smem data is fully prepared.