Closed liqiangxl closed 1 day ago
!test
Is this WAR still a draft? I know you're working on a proper fix, but since it's a silent error, could you please prioritize landing this WAR first?
Is this WAR still a draft? I know you're working on a proper fix, but since it's a silent error, could you please prioritize landing this WAR first?
I already have a fix at https://github.com/NVIDIA/Fuser/pull/3438, if that looks reasonable, we don't need this WAR.
It may take some time to review that PR, so let's get this merged for now.
!test
DistributedTransformerTest.MultiheadAttention_SP/__half
fails at main
!test
!test
when total_reduction_numel <= 1024, scheduler may use multiple reductions per block with bdimy > 1, this leads to race condition in shared memory when using async copy. Adding
cp.async.wait_all
after the 1st async copy can avoid the race, but needs to figure out the root cause before we can safely use it. So, here we set bdimy = 1 as a WAR. Should be reverted after the fix in #3438 is merged. race detected with: