🐞 Describe the Bug

Using sequence-tensor-parallel leads to nan.

Ex. with mistral-7b, stp2, debug_tensor_parallel=true, there are nans right away (job 11922721-9e3a-4c31-a16d-62d1e74cac3a):

2024-11-21 00:09:48,767 [Rank 0] After initial setup:  allocated 21,552.57 MiB | max allocated 21,552.57 MiB | reserved 21,558.00 MiB | max reserved 21,558.00 MiB | global max reserved 21,558.00 MiB
2024-11-21 00:09:48,767 [Rank 0] Initializing Training data iterator from sample 0...
2024-11-21 00:09:49,272 [Rank 0] Training ...
2024-11-21 00:09:49,915 [Rank 0] running build_ext
2024-11-21 00:09:49,926 [Rank 4] running build_ext
2024-11-21 00:09:49,937 [Rank 7] running build_ext
2024-11-21 00:09:49,938 [Rank 6] running build_ext
2024-11-21 00:09:50,014 [Rank 3] running build_ext
2024-11-21 00:09:50,014 [Rank 2] running build_ext
2024-11-21 00:09:50,022 [Rank 5] running build_ext
2024-11-21 00:09:50,094 [Rank 1] running build_ext
2024-11-21 00:10:00,178 [Rank 1] MISMATCH layer 0 fw 31,297,822 / 33,554,432 [31,297,822 nans detected locally]
2024-11-21 00:10:00,184 [Rank 0] MISMATCH layer 0 fw 31,297,822 / 33,554,432 [31,297,822 nans detected locally]
2024-11-21 00:10:00,198 [Rank 2] MISMATCH layer 0 fw 31,157,872 / 33,554,432 [31,157,872 nans detected locally]
2024-11-21 00:10:00,198 [Rank 3] MISMATCH layer 0 fw 31,157,872 / 33,554,432 [31,157,872 nans detected locally]
2024-11-21 00:10:00,403 [Rank 5] MISMATCH layer 0 fw 31,330,694 / 33,554,432 [31,330,694 nans detected locally]
2024-11-21 00:10:00,404 [Rank 4] MISMATCH layer 0 fw 31,330,694 / 33,554,432 [31,330,694 nans detected locally]
2024-11-21 00:10:00,419 [Rank 6] MISMATCH layer 0 fw 31,314,358 / 33,554,432 [31,314,358 nans detected locally]
2024-11-21 00:10:00,420 [Rank 7] MISMATCH layer 0 fw 31,314,358 / 33,554,432 [31,314,358 nans detected locally]

On the other hand, the same config with sequence_tensor_parallel=false works fine (job d61fd9fb-00b6-4b66-bc2a-f7c62d8fcc20)

🔄 Steps to Reproduce

Run anything with sequence_tensor_parallel=true and tensor_parallel>1.

🎯 Expected Behavior

No nans

ServiceNow / Fast-LLM

[bug] Nans and/or desync for sequence-tensor-parallel. #59

🐞 Describe the Bug

🔄 Steps to Reproduce

🎯 Expected Behavior