Closed scxiao closed 5 months ago
Current implementation in the FA decode forward kernel can only configure 1 wave per workgroup, this PR is to support multiple waves per workgroup, which is expected to have better performance.
Current implementation in the FA decode forward kernel can only configure 1 wave per workgroup, this PR is to support multiple waves per workgroup, which is expected to have better performance.