use bdimy = 1 to WAR smem race

liqiangxl commented 5 days ago

when total_reduction_numel <= 1024, scheduler may use multiple reductions per block with bdimy > 1, this leads to race condition in shared memory when using async copy. Adding cp.async.wait_allafter the 1st async copy can avoid the race, but needs to figure out the root cause before we can safely use it. So, here we set bdimy = 1 as a WAR. Should be reverted after the fix in #3438 is merged. race detected with:

NVFUSER_DUMP=scheduler_params,cuda_to_file NVFUSER_ENABLE=kernel_debug PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer --tool racecheck --racecheck-detect-level info  ./nvfuser_tests --gtest_filter='CombinedSchedulerTest.LayerNormBackward/dtype_double_batch_216_hidden_96'

liqiangxl commented 5 days ago

!test

naoyam commented 2 days ago

Is this WAR still a draft? I know you're working on a proper fix, but since it's a silent error, could you please prioritize landing this WAR first?

liqiangxl commented 2 days ago

Is this WAR still a draft? I know you're working on a proper fix, but since it's a silent error, could you please prioritize landing this WAR first?

I already have a fix at https://github.com/NVIDIA/Fuser/pull/3438, if that looks reasonable, we don't need this WAR.

naoyam commented 2 days ago

It may take some time to review that PR, so let's get this merged for now.

liqiangxl commented 2 days ago

!test

liqiangxl commented 2 days ago

DistributedTransformerTest.MultiheadAttention_SP/__half fails at main

liqiangxl commented 1 day ago

!test

liqiangxl commented 1 day ago

!test

NVIDIA / Fuser

use bdimy = 1 to WAR smem race #3423