Open jlamypoirier opened 1 day ago
Using sequence-tensor-parallel leads to nan.
Ex. with mistral-7b, stp2, debug_tensor_parallel=true, there are nans right away (job 11922721-9e3a-4c31-a16d-62d1e74cac3a):
debug_tensor_parallel=true
2024-11-21 00:09:48,767 [Rank 0] After initial setup: allocated 21,552.57 MiB | max allocated 21,552.57 MiB | reserved 21,558.00 MiB | max reserved 21,558.00 MiB | global max reserved 21,558.00 MiB 2024-11-21 00:09:48,767 [Rank 0] Initializing Training data iterator from sample 0... 2024-11-21 00:09:49,272 [Rank 0] Training ... 2024-11-21 00:09:49,915 [Rank 0] running build_ext 2024-11-21 00:09:49,926 [Rank 4] running build_ext 2024-11-21 00:09:49,937 [Rank 7] running build_ext 2024-11-21 00:09:49,938 [Rank 6] running build_ext 2024-11-21 00:09:50,014 [Rank 3] running build_ext 2024-11-21 00:09:50,014 [Rank 2] running build_ext 2024-11-21 00:09:50,022 [Rank 5] running build_ext 2024-11-21 00:09:50,094 [Rank 1] running build_ext 2024-11-21 00:10:00,178 [Rank 1] MISMATCH layer 0 fw 31,297,822 / 33,554,432 [31,297,822 nans detected locally] 2024-11-21 00:10:00,184 [Rank 0] MISMATCH layer 0 fw 31,297,822 / 33,554,432 [31,297,822 nans detected locally] 2024-11-21 00:10:00,198 [Rank 2] MISMATCH layer 0 fw 31,157,872 / 33,554,432 [31,157,872 nans detected locally] 2024-11-21 00:10:00,198 [Rank 3] MISMATCH layer 0 fw 31,157,872 / 33,554,432 [31,157,872 nans detected locally] 2024-11-21 00:10:00,403 [Rank 5] MISMATCH layer 0 fw 31,330,694 / 33,554,432 [31,330,694 nans detected locally] 2024-11-21 00:10:00,404 [Rank 4] MISMATCH layer 0 fw 31,330,694 / 33,554,432 [31,330,694 nans detected locally] 2024-11-21 00:10:00,419 [Rank 6] MISMATCH layer 0 fw 31,314,358 / 33,554,432 [31,314,358 nans detected locally] 2024-11-21 00:10:00,420 [Rank 7] MISMATCH layer 0 fw 31,314,358 / 33,554,432 [31,314,358 nans detected locally]
On the other hand, the same config with sequence_tensor_parallel=false works fine (job d61fd9fb-00b6-4b66-bc2a-f7c62d8fcc20)
sequence_tensor_parallel=false
Run anything with sequence_tensor_parallel=true and tensor_parallel>1.
sequence_tensor_parallel=true
tensor_parallel>1
No nans
🐞 Describe the Bug
Using sequence-tensor-parallel leads to nan.
Ex. with mistral-7b, stp2,
debug_tensor_parallel=true
, there are nans right away (job 11922721-9e3a-4c31-a16d-62d1e74cac3a):On the other hand, the same config with
sequence_tensor_parallel=false
works fine (job d61fd9fb-00b6-4b66-bc2a-f7c62d8fcc20)🔄 Steps to Reproduce
Run anything with
sequence_tensor_parallel=true
andtensor_parallel>1
.🎯 Expected Behavior
No nans