alibaba / Pai-Megatron-Patch

The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.
Apache License 2.0
729 stars 103 forks source link

当seq_length=256,qwen2.5=72b,micro_batch_size=2,global_batch_size=64,四台H20报错,当seq_length=2048时,报错消失了 #380

Closed yangzhipeng1108 closed 6 days ago

yangzhipeng1108 commented 1 week ago

52022: [rank31]: Traceback (most recent call last): 52022: [rank31]: File "/workspace3/Pai-Megatron-Patch/examples/qwen2_5/../qwen2/pretrain_qwen.py", line 276, in 52022: [rank31]: pretrain( 52022: [rank31]: File "/workspace3/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/training/training.py", line 326, in pretrain 52022: [rank31]: iteration, num_floating_point_operations_so_far = train( 52022: [rank31]: File "/workspace3/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/training/training.py", line 1247, in train 52022: [rank31]: loss_dict, skipped_iter, grad_norm, num_zeros_in_grad = train_step( 52022: [rank31]: File "/workspace3/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/training/training.py", line 688, in train_step 52022: [rank31]: losses_reduced = forward_backward_func( 52022: [rank31]: File "/workspace3/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/core/pipeline_parallel/schedules.py", line 1370, in forward_backward_pipelining_without_interleaving 52022: [rank31]: input_tensor = recv_forward(recv_tensor_shapes, config) 52022: [rank31]: File "/workspace3/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/core/pipeline_parallel/schedules.py", line 1152, in recv_forward 52022: [rank31]: input_tensors.append(p2p_communication.recv_forward(tensor_shape, config)) 52022: [rank31]: File "/workspace3/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/core/pipeline_parallel/p2p_communication.py", line 387, in recv_forward 52022: [rank31]: inputtensor, , _ = _communicate( 52022: [rank31]: File "/workspace3/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/core/pipeline_parallel/p2p_communication.py", line 355, in _communicate 52022: [rank31]: reqs = p2p_func( 52022: [rank31]: File "/workspace3/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/core/pipeline_parallel/p2p_communication.py", line 163, in _batched_p2p_ops 52022: [rank31]: reqs = torch.distributed.batch_isend_irecv(ops) 52022: [rank31]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2159, in batch_isend_irecv 52022: [rank31]: p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag) 52022: [rank31]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1913, in irecv 52022: [rank31]: return pg.recv([tensor], group_src_rank, tag) 52022: [rank31]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout 52022: [rank31]: Exception raised from doWait at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/TCPStore.cpp:559 (most recent call first): 52022: [rank31]: frame #0: c10::Error::Error(c10::SourceLocation, std::cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x98 (0x7fa789ef64b8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) 52022: [rank31]: frame #1: + 0x1041e32 (0x7fa77d8dce32 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) 52022: [rank31]: frame #2: c10d::TCPStore::doGet(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0x2a (0x7fa7819ea9fa in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) 52022: [rank31]: frame #3: c10d::TCPStore::get(std::cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xd4 (0x7fa7819eb704 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) 52022: [rank31]: frame #4: c10d::PrefixStore::get(std::cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0x2f (0x7fa781996daf in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) 52022: [rank31]: frame #5: c10d::PrefixStore::get(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0x2f (0x7fa781996daf in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) 52022: [rank31]: frame #6: c10d::PrefixStore::get(std::cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0x2f (0x7fa781996daf in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) 52022: [rank31]: frame #7: c10d::PrefixStore::get(std::cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0x2f (0x7fa781996daf in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) 52022: [rank31]: frame #8: c10d::PrefixStore::get(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0x2f (0x7fa781996daf in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) 52022: [rank31]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::cxx11::basic_string<char, std::char_traits, std::allocator > const&, int) + 0x16d (0x7fa738a9f33d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) 52022: [rank31]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, c10::Device&, c10d::OpType, int, bool) + 0x156c (0x7fa738aab0ec in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) 52022: [rank31]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator >&, int, int) + 0x7ee (0x7fa738ac6e8e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) 52022: [rank31]: frame #12: + 0x50e94f5 (0x7fa7819844f5 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) 52022: [rank31]: frame #13: + 0x50f9dea (0x7fa781994dea in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) 52022: [rank31]: frame #14: + 0x47bc20b (0x7fa78105720b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) 52022: [rank31]: frame #15: + 0x51039ae (0x7fa78199e9ae in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) 52022: [rank31]: frame #16: + 0x5107d85 (0x7fa7819a2d85 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) 52022: [rank31]: frame #17: + 0xce6510 (0x7fa7894ee510 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so) 52022: [rank31]: frame #18: + 0x44b9c7 (0x7fa788c539c7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so) 52022: [rank31]: frame #19: + 0x15adae (0x55833f277dae in /usr/bin/python) 52022: [rank31]: frame #20: _PyObject_MakeTpCall + 0x25b (0x55833f26e52b in /usr/bin/python) 52022: [rank31]: frame #21: + 0x16952b (0x55833f28652b in /usr/bin/python) 52022: [rank31]: frame #22: _PyEval_EvalFrameDefault + 0x64e2 (0x55833f266742 in /usr/bin/python) 52022: [rank31]: frame #23: _PyFunction_Vectorcall + 0x7c (0x55833f2786ac in /usr/bin/python) 52022: [rank31]: frame #24: _PyEval_EvalFrameDefault + 0x64e2 (0x55833f266742 in /usr/bin/python) 52022: [rank31]: frame #25: _PyFunction_Vectorcall + 0x7c (0x55833f2786ac in /usr/bin/python) 52022: [rank31]: frame #26: _PyEval_EvalFrameDefault + 0x64e2 (0x55833f266742 in /usr/bin/python) 52022: [rank31]: frame #27: _PyFunction_Vectorcall + 0x7c (0x55833f2786ac in /usr/bin/python) 52022: [rank31]: frame #28: _PyEval_EvalFrameDefault + 0x19b6 (0x55833f261c16 in /usr/bin/python) 52022: [rank31]: frame #29: _PyFunction_Vectorcall + 0x7c (0x55833f2786ac in /usr/bin/python) 52022: [rank31]: frame #30: _PyEval_EvalFrameDefault + 0x19b6 (0x55833f261c16 in /usr/bin/python) 52022: [rank31]: frame #31: _PyFunction_Vectorcall + 0x7c (0x55833f2786ac in /usr/bin/python) 52022: [rank31]: frame #32: _PyEval_EvalFrameDefault + 0x64e2 (0x55833f266742 in /usr/bin/python) 52022: [rank31]: frame #33: _PyFunction_Vectorcall + 0x7c (0x55833f2786ac in /usr/bin/python) 52022: [rank31]: frame #34: _PyEval_EvalFrameDefault + 0x6d5 (0x55833f260935 in /usr/bin/python) 52022: [rank31]: frame #35: _PyFunction_Vectorcall + 0x7c (0x55833f2786ac in /usr/bin/python) 52022: [rank31]: frame #36: _PyEval_EvalFrameDefault + 0x19b6 (0x55833f261c16 in /usr/bin/python) 52022: [rank31]: frame #37: _PyFunction_Vectorcall + 0x7c (0x55833f2786ac in /usr/bin/python) 52022: [rank31]: frame #38: _PyEval_EvalFrameDefault + 0x6d5 (0x55833f260935 in /usr/bin/python) 52022: [rank31]: frame #39: _PyFunction_Vectorcall + 0x7c (0x55833f2786ac in /usr/bin/python) 52022: [rank31]: frame #40: _PyEval_EvalFrameDefault + 0x6d5 (0x55833f260935 in /usr/bin/python) 52022: [rank31]: frame #41: _PyFunction_Vectorcall + 0x7c (0x55833f2786ac in /usr/bin/python) 52022: [rank31]: frame #42: _PyEval_EvalFrameDefault + 0x19b6 (0x55833f261c16 in /usr/bin/python) 52022: [rank31]: frame #43: + 0x140096 (0x55833f25d096 in /usr/bin/python) 52022: [rank31]: frame #44: PyEval_EvalCode + 0x86 (0x55833f352f66 in /usr/bin/python) 52022: [rank31]: frame #45: + 0x260e98 (0x55833f37de98 in /usr/bin/python) 52022: [rank31]: frame #46: + 0x25a79b (0x55833f37779b in /usr/bin/python) 52022: [rank31]: frame #47: + 0x260be5 (0x55833f37dbe5 in /usr/bin/python) 52022: [rank31]: frame #48: _PyRun_SimpleFileObject + 0x1a8 (0x55833f37d0c8 in /usr/bin/python) 52022: [rank31]: frame #49: _PyRun_AnyFileObject + 0x43 (0x55833f37cd13 in /usr/bin/python) 52022: [rank31]: frame #50: Py_RunMain + 0x2be (0x55833f36f70e in /usr/bin/python) 52022: [rank31]: frame #51: Py_BytesMain + 0x2d (0x55833f345dfd in /usr/bin/python) 52022: [rank31]: frame #52: + 0x29d90 (0x7fa795904d90 in /usr/lib/x86_64-linux-gnu/libc.so.6) 52022: [rank31]: frame #53: __libc_start_main + 0x80 (0x7fa795904e40 in /usr/lib/x86_64-linux-gnu/libc.so.6) 52022: [rank31]: frame #54: _start + 0x25 (0x55833f345cf5 in /usr/bin/python) 52022: [rank31]: . This may indicate a possible application crash on rank 0 or a network set up issue.

lostkevin commented 1 week ago

可以先看一下最新的PR,应该修复了这个问题

yangzhipeng1108 commented 6 days ago

谢谢