Closed XLzed closed 1 week ago
Met the same timeout issue. The issue is coming from NCCL default timeout value not set effectively when NeMo calls Megatron. If using NCCL backend, the default timeout value is changed to 10 mins. That's why you see NCCL timeout as Timeout(ms)=600000
(600000 milliseconds = 10 mins)
Need to make code changes to pass a larger timeout value (e.g. 30 mins) https://github.com/NVIDIA/Megatron-LM/commit/e69187bc3679ea5841030a165d587bb48b56ee77 and also when calling parallel_state.initialize_model_parallel from NeMo https://github.com/NVIDIA/NeMo/blob/v1.23.0/nemo/collections/nlp/parts/nlp_overrides.py#L125
Alternatively, upgrade to a later version of Megatron after the commit above.
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.
Describe the bug
Context parallel does not work in some cases, such as pretrain llama-34b with 64 A800 GPUs and seqlen>=32768. But using megatron-lm directly has no problem with the same config. I want to use the SFT support like sequence packing in Nemo, hope to solve this soon.
Environment details
test cases![image](https://github.com/NVIDIA/NeMo/assets/46588381/0921b1bc-0a02-4574-adce-cb7afa4b7b48)
error msg details
2.nccl timeout (tp=2, pp=4, cp=8)
3.nccl timeout (tp=8, pp=1, cp=8)
pretrain_llama34b_config.yaml