NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.14k stars 791 forks source link

Barrier error when using with PyTorch 2.1.1 using DeepSpeed with torch NCCL #1132

Open jon-chuang opened 8 months ago

jon-chuang commented 8 months ago
01/02/2024 00:20:14 - INFO - accelerate.accelerator - Saving current state to path/to/saved/model/checkpoint-490
01/02/2024 00:20:14 - INFO - accelerate.accelerator - Saving DeepSpeed Model and Optimizer
Traceback (most recent call last):
  File "/home/jonch/Desktop/Programming/consulting/sudoai/../../mlsys/diffusers/examples/consistency_distillation/train_lcm_distill_lora_sd_wds.py", line 1344, in <module>
    main(args)
  File "/home/jonch/Desktop/Programming/consulting/sudoai/../../mlsys/diffusers/examples/consistency_distillation/train_lcm_distill_lora_sd_wds.py", line 1318, in main
    accelerator.save_state(save_path)
  File "/home/jonch/Desktop/Programming/mlsys/accelerate/src/accelerate/accelerator.py", line 2711, in save_state
    model.save_checkpoint(output_dir, ckpt_id, **save_model_func_kwargs)
  File "/home/jonch/Desktop/Programming/pytorch-stable/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3049, in save_checkpoint
    dist.barrier()
  File "/home/jonch/Desktop/Programming/pytorch-stable/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
    return func(*args, **kwargs)
  File "/home/jonch/Desktop/Programming/pytorch-stable/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier
    return cdb.barrier(group=group, async_op=async_op)
  File "/home/jonch/Desktop/Programming/pytorch-stable/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 312, in barrier
    return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids)
  File "/home/jonch/Desktop/Programming/pytorch-stable/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/jonch/Desktop/Programming/pytorch-stable/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3698, in barrier
    work = group.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, internal error - please report this issue to the NCCL developers, NCCL version 2.18.1
ncclInternalError: Internal check failed.
Last error:
Socket recv failed while polling for opId=0x7fe3f4039520
Naive-Bayes commented 1 month ago

I meet the same problem. This may not because nccl, but the deepspeed. When using multiple GPUs to train, in traditional DDP paradigm, we usually save the checkpoint on rank 0. But when we use ZeRO, the optimizer will be divided (ZeRO 1) or the gradient(ZeRO 2). So we need save on all rank.