Currently, in the TMA circular buffering pass, we generate code like below
// main loop
for i:
if (elect_sync) {
arrive-load;
TMA;
}
wait-load;
compute;
__syncthreads(); // for avoiding WAR harzard
}
This PR adds an option to change the generated code into:
// main loop
for i:
if (elect_sync) {
wait-compute;
arrive-load;
TMA;
}
wait-load;
compute;
arrive-compute;
}
That is, the plain old __syncthreads() is replaced with an arrive-wait of a mbarrier. With this change, each circular buffer stage will be using two mbarriers, one for signaling that the corresponding data has been loaded to smem, and ready to read (RAW harzard), and another for signaling that the corresponding data has been fully read, there is no more read in the future, so feel free to reuse the space to hold new data (WAR harzard).
In theory, I am expecting better performance when enabling this feature, because __syncthreads() is a hard sync that requires all warp groups to reach this point, while the arrive wait barrier is a much softer sync that requires all warp groups has passed a point before. But unfortunately, the perf is worse for my matmul kernel. This is why I made this feature a default-off one.
The main purpose for this PR is not to support a slower way to wait. The reason that I does this work is, this way of having two mbarriers, one for RAW another for WAR, is very close to the warp specialization code we want to generate, and the work in this PR can be largely reused by warp specialization. So, the main purpose for this PR is to serve as an incremental step towards warp specialization, and providing a second option for people to try on is only a side benefit.
Besides, the above code is a very good illustration why we should go for warp specialization. On Hopper, both TMA and MMA are async, however, putting load and compute into the same warp makes it impossible for us to truely pipeline both TMA and MMA. From the above code, we can easily see that, for each MMA, it not only need to wait for the data to be ready (RAW), but also need to wait for the buffer for the next load to be freed. Why? Just because it is a few lines above in the code. The MMA in this iteration has no real dependency on the buffer being freed, because the MMA in this iteration will not touch it, but there is a fake dependency just because it is earlier in code. What if the the data for this iteration's MMA is ready, but the buffer for this iteration's load is not freed yet? Can I just start doing MMA while waiting for that buffer to be freed? No I can not, because we are doing load and compute in the same warp, we need both to be ready to move on with this iteration. With this observation, it is very natural that we should separate the loading and computation into different warps.
Currently, in the TMA circular buffering pass, we generate code like below
This PR adds an option to change the generated code into:
That is, the plain old
__syncthreads()
is replaced with an arrive-wait of a mbarrier. With this change, each circular buffer stage will be using two mbarriers, one for signaling that the corresponding data has been loaded to smem, and ready to read (RAW harzard), and another for signaling that the corresponding data has been fully read, there is no more read in the future, so feel free to reuse the space to hold new data (WAR harzard).In theory, I am expecting better performance when enabling this feature, because
__syncthreads()
is a hard sync that requires all warp groups to reach this point, while the arrive wait barrier is a much softer sync that requires all warp groups has passed a point before. But unfortunately, the perf is worse for my matmul kernel. This is why I made this feature a default-off one.The main purpose for this PR is not to support a slower way to wait. The reason that I does this work is, this way of having two mbarriers, one for RAW another for WAR, is very close to the warp specialization code we want to generate, and the work in this PR can be largely reused by warp specialization. So, the main purpose for this PR is to serve as an incremental step towards warp specialization, and providing a second option for people to try on is only a side benefit.
Besides, the above code is a very good illustration why we should go for warp specialization. On Hopper, both TMA and MMA are async, however, putting load and compute into the same warp makes it impossible for us to truely pipeline both TMA and MMA. From the above code, we can easily see that, for each MMA, it not only need to wait for the data to be ready (RAW), but also need to wait for the buffer for the next load to be freed. Why? Just because it is a few lines above in the code. The MMA in this iteration has no real dependency on the buffer being freed, because the MMA in this iteration will not touch it, but there is a fake dependency just because it is earlier in code. What if the the data for this iteration's MMA is ready, but the buffer for this iteration's load is not freed yet? Can I just start doing MMA while waiting for that buffer to be freed? No I can not, because we are doing load and compute in the same warp, we need both to be ready to move on with this iteration. With this observation, it is very natural that we should separate the loading and computation into different warps.