Open wuhouming opened 7 months ago
hello, did you solve it?
hello, did you solve it?
Could you change to use GLOO as backend? Does it work?
hello, I meet the same problem, have you solved it?
Could you change to GLOO backend (for both P2P and collective) to see whether it works or not? We didn't fix NCCL with torchrun yet.
hello, did you solve it?
I didn't solve the stuck problem. But I guess it is caused by simultaneous transmission and reception between ranks. Batched P2P communication ops such as Megatron-LM could solve it. Code like this:
def send_forward_recv_backward(output_tensor: torch.Tensor,
tensor_shape: Shape,
config: ModelParallelConfig) -> torch.Tensor:
"""Batched send and recv with next rank in pipeline.
See _communicate for argument details.
"""
if core.parallel_state.is_pipeline_last_stage():
output_tensor_grad = None
else:
if config.timers is not None:
config.timers('forward-send-backward-recv', log_level=2).start()
_, output_tensor_grad,_ = _communicate(
tensor_send_next=output_tensor,
tensor_send_prev=None,
recv_prev=False,
recv_next=True,
tensor_shape=tensor_shape,
config=config)
if config.timers is not None:
config.timers('forward-send-backward-recv').stop()
return output_tensor_grad
Could you change to GLOO backend (for both P2P and collective) to see whether it works or not? We didn't fix NCCL with torchrun yet.
It works! Thanks a lot!
I tried to change the slurm script (i.e., prof_steps.sh) to torchrun and ran it directly, but encountered a stuck issue with NCCL as collective_backend. The torchran script is as follows:
When I choose ‘gpipe’ or '1f1b' as the pipeline method, it can work normally. However, selecting 'interleave' will result in a loss of 0, while 'chimera' leads the program to get stuck and then raises an error of timeout.