Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
27.42k stars 3.29k forks source link

NCCL when trying to train on 2 nodes #19544

Open waynemystir opened 4 months ago

waynemystir commented 4 months ago

Bug description

I am trying to run a very simple training script for 2 nodes and I always get this error:

Output:

(ve) root@442a8ba5c0c6:~/ptl# . wr.sh 
Start fitting...
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
[W Utils.hpp:164] Warning: Environment variable NCCL_BLOCKING_WAIT is deprecated; use TORCH_NCCL_BLOCKING_WAIT instead (function getCvarBool)
442a8ba5c0c6:126284:126284 [0] NCCL INFO cudaDriverVersion 12030
442a8ba5c0c6:126284:126284 [0] NCCL INFO Bootstrap : Using eth0:172.18.0.2<0>
442a8ba5c0c6:126284:126284 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
442a8ba5c0c6:126284:126284 [0] NCCL INFO init.cc:1627 Cuda Host Alloc Size 4 pointer 0x7f1756e00000
442a8ba5c0c6:126284:126391 [0] NCCL INFO Failed to open libibverbs.so[.1]
442a8ba5c0c6:126284:126391 [0] NCCL INFO NET/Socket : Using [0]eth0:172.18.0.2<0>
442a8ba5c0c6:126284:126391 [0] NCCL INFO Using non-device net plugin version 0
442a8ba5c0c6:126284:126391 [0] NCCL INFO Using network Socket

442a8ba5c0c6:126284:126391 [0] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to 192.168.240.2<39817> failed : Software caused connection abort
442a8ba5c0c6:126284:126391 [0] NCCL INFO misc/socket.cc:565 -> 2
442a8ba5c0c6:126284:126391 [0] NCCL INFO misc/socket.cc:619 -> 2
442a8ba5c0c6:126284:126391 [0] NCCL INFO bootstrap.cc:274 -> 2
442a8ba5c0c6:126284:126391 [0] NCCL INFO init.cc:1388 -> 2
442a8ba5c0c6:126284:126391 [0] NCCL INFO group.cc:64 -> 2 [Async thread]
442a8ba5c0c6:126284:126284 [0] NCCL INFO group.cc:418 -> 2
442a8ba5c0c6:126284:126284 [0] NCCL INFO group.cc:95 -> 2
[rank1]:[W Module.cpp:156] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

Traceback (most recent call last):
  File "/root/ptl/wrain.py", line 68, in <module>
    main()
  File "/root/ptl/wrain.py", line 62, in main
    trainer.fit(model, train_dataloaders=train_data)
  File "/root/ptl/ve/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 543, in fit
    call._call_and_handle_interrupt(
  File "/root/ptl/ve/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/root/ptl/ve/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
    return function(*args, **kwargs)
  File "/root/ptl/ve/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/root/ptl/ve/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 943, in _run
    self.__setup_profiler()
  File "/root/ptl/ve/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1078, in __setup_profiler
    self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
  File "/root/ptl/ve/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1248, in log_dir
    dirpath = self.strategy.broadcast(dirpath)
  File "/root/ptl/ve/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 307, in broadcast
    torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
  File "/root/ptl/ve/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
    return func(*args, **kwargs)
  File "/root/ptl/ve/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2411, in broadcast_object_list
    broadcast(object_sizes_tensor, src=src, group=group)
  File "/root/ptl/ve/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
    return func(*args, **kwargs)
  File "/root/ptl/ve/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1910, in broadcast
    work = default_pg.broadcast([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:
socketStartConnect: Connect to 192.168.240.2<39817> failed : Software caused connection abort
Exception raised from getNCCLComm at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691 (most recent call first):
C++ CapturedTraceback:
#4 c10::Error::Error(c10::SourceLocation, std::string) from ??:0
#5 c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocator<c10::Device> > const&, c10d::OpType, int, bool) [clone .cold] from ProcessGroupNCCL.cpp:0
#6 c10d::ProcessGroupNCCL::broadcast(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&) from ??:0
#7 c10d::ops::(anonymous namespace)::broadcast_CUDA(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long) from Ops.cpp:0
#8 std::decay<c10::guts::infer_function_traits<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long), std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long> > >::type::return_type>::type c10::impl::call_functor_with_args_from_stack_<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long), std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long> >, false, 0ul, 1ul, 2ul, 3ul, 4ul, 5ul, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long>(c10::OperatorKernel*, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*, std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul, 4ul, 5ul>, c10::guts::typelist::typelist<c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long>*) from :0
#9 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long), std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from :0
#10 torch::autograd::basicAutogradNotImplementedFallbackImpl(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
#11 void c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::autograd_fallback>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from VariableFallbackKernel.cpp:0
#12 c10::impl::BoxedKernelWrapper<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long) from :0
#13 c10d::ProcessGroup::broadcast(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&) from :0
#14 pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&)#1}, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&)#1}&&, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (*)(c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from :0
#15 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0
#16 PyObject_CallFunctionObjArgs from ??:0
#17 _PyObject_MakeTpCall from ??:0
#18 PyMethod_New from ??:0
#19 _PyEval_EvalFrameDefault from ??:0
#20 _PyFunction_Vectorcall from ??:0
#21 PyObject_Call from ??:0
#22 _PyEval_EvalFrameDefault from ??:0
#23 _PyFunction_Vectorcall from ??:0
#24 _PyEval_EvalFrameDefault from ??:0
#25 _PyFunction_Vectorcall from ??:0
#26 PyObject_Call from ??:0
#27 _PyEval_EvalFrameDefault from ??:0
#28 _PyFunction_Vectorcall from ??:0
#29 _PyEval_EvalFrameDefault from ??:0
#30 _PyFunction_Vectorcall from ??:0
#31 _PyEval_EvalFrameDefault from ??:0
#32 PyObject_IsTrue from ??:0
#33 _PyObject_GenericGetAttrWithDict from ??:0
#34 PyObject_GetAttr from ??:0
#35 _PyEval_EvalFrameDefault from ??:0
#36 _PyFunction_Vectorcall from ??:0
#37 _PyEval_EvalFrameDefault from ??:0
#38 PyMethod_New from ??:0
#39 _PyEval_EvalFrameDefault from ??:0
#40 PyMethod_New from ??:0
#41 _PyEval_EvalFrameDefault from ??:0
#42 PyMethod_New from ??:0
#43 PyObject_Call from ??:0
#44 _PyEval_EvalFrameDefault from ??:0
#45 _PyFunction_Vectorcall from ??:0
#46 _PyEval_EvalFrameDefault from ??:0
#47 PyMethod_New from ??:0
#48 _PyEval_EvalFrameDefault from ??:0
#49 _PyFunction_Vectorcall from ??:0
#50 _PyEval_EvalFrameDefault from ??:0
#51 PyEval_EvalCode from ??:0
#52 PyEval_EvalCode from ??:0
#53 PyUnicode_Tailmatch from ??:0
#54 PyInit__collections from ??:0
#55 PyUnicode_Tailmatch from ??:0
#56 _PyRun_SimpleFileObject from ??:0
#57 _PyRun_AnyFileObject from ??:0
#58 Py_RunMain from ??:0
#59 Py_BytesMain from ??:0
#60 __libc_init_first from ??:0
#61 __libc_start_main from ??:0
#62 _start from ??:0`

What version are you seeing the problem on?

v2.2

How to reproduce the bug

I'm trying to run a very simple script for 2 nodes:

`export MASTER_ADDR=194.26.196.214
export MASTER_PORT=31421
export WORLD_SIZE=2
export NODE_RANK=1
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
export TORCH_SHOW_CPP_STACKTRACES=1
export TORCH_NCCL_BLOCKING_WAIT=1

export NCCL_P2P_DISABLE=1
export NCCL_P2P_LEVEL=NVL
export NCCL_IB_GID_INDEX=3
python train.py`

Where train.py is this:

`import os
import torch
from torch.utils.data import DataLoader, Dataset
from pytorch_lightning import LightningModule, Trainer

class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x): 
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        print("batch_idx=", batch_idx)
        loss = self(batch).sum()
        self.log(
            "train_loss",
            loss,
            on_step=True,
            on_epoch=True,
            prog_bar=True,
            logger=True,
        )
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)

def main():
    train_data = DataLoader(RandomDataset(32, 64000), batch_size=8000)
    model = BoringModel()
    trainer = Trainer(
        devices=1,
        num_nodes=2,
        accelerator="gpu",
        strategy="ddp",
        limit_train_batches=100000,
        limit_val_batches=1,
        num_sanity_val_steps=0,
        max_epochs=1,
        log_every_n_steps=1,
    )   
    print(f"Start fitting...")
    trainer.fit(model, train_dataloaders=train_data)

    print("DONE")

if __name__ == "__main__":
    main()`

Error messages and logs

# Error messages and logs here please

Environment

Current environment ``` #- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): #- PyTorch Lightning Version (e.g., 1.5.0): #- Lightning App Version (e.g., 0.5.2): #- PyTorch Version (e.g., 2.0): #- Python version (e.g., 3.9): #- OS (e.g., Linux): #- CUDA/cuDNN version: #- GPU models and configuration: #- How you installed Lightning(`conda`, `pip`, source): #- Running environment of LightningApp (e.g. local, cloud): ```

More info

No response

p208p2002 commented 4 months ago

Did you set difference NODE_RANK to each node ? I currently run multi-node training with lightning v2.2.0 + deepspeed on azure's gpu cluster successfully, without manual set any env varable, (maybe it's set by the cluster system)