OpenLLMAI / OpenRLHF

An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning & Iterative DPO & LoRA & Mixtral)
https://openrlhf.readthedocs.io/
Apache License 2.0
1.73k stars 164 forks source link

AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 256 != 2 * 18 * 7 #346

Closed hehebamei closed 3 days ago

hehebamei commented 4 days ago

I changed the conda environment to docker, but this error still occurs. What should I do? Is this error caused by environmental lib?

hijkzzz commented 4 days ago

How many gpus did you use?

hehebamei commented 4 days ago

How many gpus did you use?

Generally, our lab has 7GPUs running and I made no changes to the setting files. Our 8th graphics card was banned due to instability, does this matter?

hijkzzz commented 4 days ago

Try setting the batch size to a multiple of 7. such as 126

hehebamei commented 4 days ago

@hijkzzz its ok to set the batch size to a multiple of 14. But it occurs another issue. How to solve it? Very thank you much for ur help. rank2: Traceback (most recent call last): rank2: File "/project/OpenRLHF/examples/scripts/../train_sft.py", line 180, in

rank2: File "/project/OpenRLHF/examples/scripts/../train_sft.py", line 89, in train rank2: (model, optim, scheduler) = strategy.prepare((model, optim, scheduler)) rank2: File "/root/.local/lib/python3.10/site-packages/openrlhf/utils/deepspeed.py", line 157, in prepare

rank2: File "/root/.local/lib/python3.10/site-packages/openrlhf/utils/deepspeed.py", line 167, in _ds_init_trainmodel rank2: engine, optim, , scheduler = deepspeed.initialize( rank2: File "/root/.local/lib/python3.10/site-packages/deepspeed/init.py", line 176, in initialize rank2: engine = DeepSpeedEngine(args=args, rank2: File "/root/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 262, in init

rank2: File "/root/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1157, in _configure_distributed_model

rank2: File "/root/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1077, in _broadcast_model rank2: dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group) rank2: File "/root/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper rank2: return func(*args, kwargs) rank2: File "/root/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 224, in broadcast rank2: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) rank2: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn rank2: return fn(*args, *kwargs) rank2: File "/root/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 205, in broadcast rank2: return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) rank2: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper rank2: return func(args, kwargs) rank2: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2140, in broadcast rank2: work = group.broadcast([tensor], opts) rank2: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer rank2: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): rank2: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x733aaa77a897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) rank2: frame #1: + 0x5b3a23e (0x733a96c8f23e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) rank2: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x733a96c89c87 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) rank2: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x733a96c89f82 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) rank2: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x733a96c8afd1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) rank2: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x733a96c3f371 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) rank2: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x733a96c3f371 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) rank2: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x733a96c3f371 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) rank2: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x733a96c3f371 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) rank2: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x733a5e44f6d9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) rank2: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x733a5e456b60 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) rank2: frame #11: c10d::ProcessGroupNCCL::broadcast(std::vector<at::Tensor, std::allocator >&, c10d::BroadcastOptions const&) + 0x5a8 (0x733a5e465748 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) rank2: frame #12: + 0x5addde6 (0x733a96c32de6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) rank2: frame #13: + 0x5ae6cd3 (0x733a96c3bcd3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) rank2: frame #14: + 0x5ae9b39 (0x733a96c3eb39 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) rank2: frame #15: + 0x5124446 (0x733a96279446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) rank2: frame #16: + 0x1acf4b8 (0x733a92c244b8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) rank2: frame #17: + 0x5aee23a (0x733a96c4323a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) rank2: frame #18: + 0x5afe26c (0x733a96c5326c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) rank2: frame #19: + 0xd244b5 (0x733aa97f34b5 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so) rank2: frame #20: + 0x47de04 (0x733aa8f4ce04 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so) rank2: frame #21: + 0x15a10e (0x6041c3a1a10e in /usr/bin/python) rank2: frame #22: _PyObject_MakeTpCall + 0x25b (0x6041c3a10a7b in /usr/bin/python) rank2: frame #23: + 0x168acb (0x6041c3a28acb in /usr/bin/python) rank2: frame #24: _PyEval_EvalFrameDefault + 0x614a (0x6041c3a08cfa in /usr/bin/python) rank2: frame #25: _PyFunction_Vectorcall + 0x7c (0x6041c3a1a9fc in /usr/bin/python) rank2: frame #26: PyObject_Call + 0x122 (0x6041c3a29492 in /usr/bin/python) rank2: frame #27: _PyEval_EvalFrameDefault + 0x2a27 (0x6041c3a055d7 in /usr/bin/python) rank2: frame #28: _PyFunction_Vectorcall + 0x7c (0x6041c3a1a9fc in /usr/bin/python) rank2: frame #29: _PyEval_EvalFrameDefault + 0x198c (0x6041c3a0453c in /usr/bin/python) rank2: frame #30: _PyFunction_Vectorcall + 0x7c (0x6041c3a1a9fc in /usr/bin/python) rank2: frame #31: PyObject_Call + 0x122 (0x6041c3a29492 in /usr/bin/python) rank2: frame #32: _PyEval_EvalFrameDefault + 0x2a27 (0x6041c3a055d7 in /usr/bin/python) rank2: frame #33: + 0x1687f1 (0x6041c3a287f1 in /usr/bin/python) rank2: frame #34: _PyEval_EvalFrameDefault + 0x198c (0x6041c3a0453c in /usr/bin/python) rank2: frame #35: _PyFunction_Vectorcall + 0x7c (0x6041c3a1a9fc in /usr/bin/python) rank2: frame #36: PyObject_Call + 0x122 (0x6041c3a29492 in /usr/bin/python) rank2: frame #37: _PyEval_EvalFrameDefault + 0x2a27 (0x6041c3a055d7 in /usr/bin/python) rank2: frame #38: _PyFunction_Vectorcall + 0x7c (0x6041c3a1a9fc in /usr/bin/python) rank2: frame #39: _PyEval_EvalFrameDefault + 0x198c (0x6041c3a0453c in /usr/bin/python) rank2: frame #40: + 0x1687f1 (0x6041c3a287f1 in /usr/bin/python) rank2: frame #41: _PyEval_EvalFrameDefault + 0x614a (0x6041c3a08cfa in /usr/bin/python) rank2: frame #42: + 0x1687f1 (0x6041c3a287f1 in /usr/bin/python) rank2: frame #43: _PyEval_EvalFrameDefault + 0x614a (0x6041c3a08cfa in /usr/bin/python) rank2: frame #44: _PyFunction_Vectorcall + 0x7c (0x6041c3a1a9fc in /usr/bin/python) rank2: frame #45: _PyObject_FastCallDictTstate + 0x16d (0x6041c3a0fcbd in /usr/bin/python) rank2: frame #46: + 0x164a64 (0x6041c3a24a64 in /usr/bin/python) rank2: frame #47: _PyObject_MakeTpCall + 0x1fc (0x6041c3a10a1c in /usr/bin/python) rank2: frame #48: _PyEval_EvalFrameDefault + 0x75a0 (0x6041c3a0a150 in /usr/bin/python) rank2: frame #49: _PyFunction_Vectorcall + 0x7c (0x6041c3a1a9fc in /usr/bin/python) rank2: frame #50: _PyEval_EvalFrameDefault + 0x198c (0x6041c3a0453c in /usr/bin/python) rank2: frame #51: + 0x16893e (0x6041c3a2893e in /usr/bin/python) rank2: frame #52: _PyEval_EvalFrameDefault + 0x2a27 (0x6041c3a055d7 in /usr/bin/python) rank2: frame #53: _PyFunction_Vectorcall + 0x7c (0x6041c3a1a9fc in /usr/bin/python) rank2: frame #54: _PyEval_EvalFrameDefault + 0x8ac (0x6041c3a0345c in /usr/bin/python) rank2: frame #55: _PyFunction_Vectorcall + 0x7c (0x6041c3a1a9fc in /usr/bin/python) rank2: frame #56: _PyEval_EvalFrameDefault + 0x6bd (0x6041c3a0326d in /usr/bin/python) rank2: frame #57: + 0x13f9c6 (0x6041c39ff9c6 in /usr/bin/python) rank2: frame #58: PyEval_EvalCode + 0x86 (0x6041c3af5256 in /usr/bin/python) rank2: frame #59: + 0x260108 (0x6041c3b20108 in /usr/bin/python) rank2: frame #60: + 0x2599cb (0x6041c3b199cb in /usr/bin/python) rank2: frame #61: + 0x25fe55 (0x6041c3b1fe55 in /usr/bin/python) rank2: frame #62: _PyRun_SimpleFileObject + 0x1a8 (0x6041c3b1f338 in /usr/bin/python) rank2: frame #63: _PyRun_AnyFileObject + 0x43 (0x6041c3b1ef83 in /usr/bin/python) rank2: . This may indicate a possible application crash on rank 0 or a network set up issue. rank6: Traceback (most recent call last): rank6: File "/project/OpenRLHF/examples/scripts/../train_sft.py", line 180, in

rank6: File "/project/OpenRLHF/examples/scripts/../train_sft.py", line 89, in train rank6: (model, optim, scheduler) = strategy.prepare((model, optim, scheduler)) rank6: File "/root/.local/lib/python3.10/site-packages/openrlhf/utils/deepspeed.py", line 157, in prepare

rank6: File "/root/.local/lib/python3.10/site-packages/openrlhf/utils/deepspeed.py", line 167, in _ds_init_trainmodel rank6: engine, optim, , scheduler = deepspeed.initialize( rank6: File "/root/.local/lib/python3.10/site-packages/deepspeed/init.py", line 176, in initialize rank6: engine = DeepSpeedEngine(args=args, rank6: File "/root/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 262, in init

rank6: File "/root/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1157, in _configure_distributed_model

rank6: File "/root/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1077, in _broadcast_model rank6: dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group) rank6: File "/root/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper rank6: return func(*args, kwargs) rank6: File "/root/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 224, in broadcast rank6: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) rank6: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn rank6: return fn(*args, *kwargs) rank6: File "/root/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 205, in broadcast rank6: return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) rank6: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper rank6: return func(args, kwargs) rank6: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2140, in broadcast rank6: work = group.broadcast([tensor], opts) rank6: torch.distributed.DistBackendError: [6] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer rank6: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): rank6: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7e7562492897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) rank6: frame #1: + 0x5b3a23e (0x7e754ea8f23e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) rank6: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7e754ea89c87 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) rank6: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7e754ea89f82 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) rank6: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7e754ea8afd1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) rank6: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7e754ea3f371 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) rank6: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7e754ea3f371 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) rank6: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7e754ea3f371 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) rank6: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7e754ea3f371 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) rank6: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7e751624f6d9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) rank6: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7e7516256b60 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) rank6: frame #11: c10d::ProcessGroupNCCL::broadcast(std::vector<at::Tensor, std::allocator >&, c10d::BroadcastOptions const&) + 0x5a8 (0x7e7516265748 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) rank6: frame #12: + 0x5addde6 (0x7e754ea32de6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) rank6: frame #13: + 0x5ae6cd3 (0x7e754ea3bcd3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) rank6: frame #14: + 0x5ae9b39 (0x7e754ea3eb39 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) rank6: frame #15: + 0x5124446 (0x7e754e079446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) rank6: frame #16: + 0x1acf4b8 (0x7e754aa244b8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) rank6: frame #17: + 0x5aee23a (0x7e754ea4323a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) rank6: frame #18: + 0x5afe26c (0x7e754ea5326c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) rank6: frame #19: + 0xd244b5 (0x7e75615f34b5 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so) rank6: frame #20: + 0x47de04 (0x7e7560d4ce04 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so) rank6: frame #21: + 0x15a10e (0x649a8c1a810e in /usr/bin/python) rank6: frame #22: _PyObject_MakeTpCall + 0x25b (0x649a8c19ea7b in /usr/bin/python) rank6: frame #23: + 0x168acb (0x649a8c1b6acb in /usr/bin/python) rank6: frame #24: _PyEval_EvalFrameDefault + 0x614a (0x649a8c196cfa in /usr/bin/python) rank6: frame #25: _PyFunction_Vectorcall + 0x7c (0x649a8c1a89fc in /usr/bin/python) rank6: frame #26: PyObject_Call + 0x122 (0x649a8c1b7492 in /usr/bin/python) rank6: frame #27: _PyEval_EvalFrameDefault + 0x2a27 (0x649a8c1935d7 in /usr/bin/python) rank6: frame #28: _PyFunction_Vectorcall + 0x7c (0x649a8c1a89fc in /usr/bin/python) rank6: frame #29: _PyEval_EvalFrameDefault + 0x198c (0x649a8c19253c in /usr/bin/python) rank6: frame #30: _PyFunction_Vectorcall + 0x7c (0x649a8c1a89fc in /usr/bin/python) rank6: frame #31: PyObject_Call + 0x122 (0x649a8c1b7492 in /usr/bin/python) rank6: frame #32: _PyEval_EvalFrameDefault + 0x2a27 (0x649a8c1935d7 in /usr/bin/python) rank6: frame #33: + 0x1687f1 (0x649a8c1b67f1 in /usr/bin/python) rank6: frame #34: _PyEval_EvalFrameDefault + 0x198c (0x649a8c19253c in /usr/bin/python) rank6: frame #35: _PyFunction_Vectorcall + 0x7c (0x649a8c1a89fc in /usr/bin/python) rank6: frame #36: PyObject_Call + 0x122 (0x649a8c1b7492 in /usr/bin/python) rank6: frame #37: _PyEval_EvalFrameDefault + 0x2a27 (0x649a8c1935d7 in /usr/bin/python) rank6: frame #38: _PyFunction_Vectorcall + 0x7c (0x649a8c1a89fc in /usr/bin/python) rank6: frame #39: _PyEval_EvalFrameDefault + 0x198c (0x649a8c19253c in /usr/bin/python) rank6: frame #40: + 0x1687f1 (0x649a8c1b67f1 in /usr/bin/python) rank6: frame #41: _PyEval_EvalFrameDefault + 0x614a (0x649a8c196cfa in /usr/bin/python) rank6: frame #42: + 0x1687f1 (0x649a8c1b67f1 in /usr/bin/python) rank6: frame #43: _PyEval_EvalFrameDefault + 0x614a (0x649a8c196cfa in /usr/bin/python) rank6: frame #44: _PyFunction_Vectorcall + 0x7c (0x649a8c1a89fc in /usr/bin/python) rank6: frame #45: _PyObject_FastCallDictTstate + 0x16d (0x649a8c19dcbd in /usr/bin/python) rank6: frame #46: + 0x164a64 (0x649a8c1b2a64 in /usr/bin/python) rank6: frame #47: _PyObject_MakeTpCall + 0x1fc (0x649a8c19ea1c in /usr/bin/python) rank6: frame #48: _PyEval_EvalFrameDefault + 0x75a0 (0x649a8c198150 in /usr/bin/python) rank6: frame #49: _PyFunction_Vectorcall + 0x7c (0x649a8c1a89fc in /usr/bin/python) rank6: frame #50: _PyEval_EvalFrameDefault + 0x198c (0x649a8c19253c in /usr/bin/python) rank6: frame #51: + 0x16893e (0x649a8c1b693e in /usr/bin/python) rank6: frame #52: _PyEval_EvalFrameDefault + 0x2a27 (0x649a8c1935d7 in /usr/bin/python) rank6: frame #53: _PyFunction_Vectorcall + 0x7c (0x649a8c1a89fc in /usr/bin/python) rank6: frame #54: _PyEval_EvalFrameDefault + 0x8ac (0x649a8c19145c in /usr/bin/python) rank6: frame #55: _PyFunction_Vectorcall + 0x7c (0x649a8c1a89fc in /usr/bin/python) rank6: frame #56: _PyEval_EvalFrameDefault + 0x6bd (0x649a8c19126d in /usr/bin/python) rank6: frame #57: + 0x13f9c6 (0x649a8c18d9c6 in /usr/bin/python) rank6: frame #58: PyEval_EvalCode + 0x86 (0x649a8c283256 in /usr/bin/python) rank6: frame #59: + 0x260108 (0x649a8c2ae108 in /usr/bin/python) rank6: frame #60: + 0x2599cb (0x649a8c2a79cb in /usr/bin/python) rank6: frame #61: + 0x25fe55 (0x649a8c2ade55 in /usr/bin/python) rank6: frame #62: _PyRun_SimpleFileObject + 0x1a8 (0x649a8c2ad338 in /usr/bin/python) rank6: frame #63: _PyRun_AnyFileObject + 0x43 (0x649a8c2acf83 in /usr/bin/python) rank6: . This may indicate a possible application crash on rank 0 or a network set up issue. [2024-07-05 08:03:23,041] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 21704 [2024-07-05 08:03:23,095] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 21705

hijkzzz commented 4 days ago

https://blog.csdn.net/qq874455953/article/details/134408257

hehebamei commented 4 days ago

https://blog.csdn.net/qq874455953/article/details/134408257

@hijkzzz I tried to fix it following this blog before u brought it up, but it does not work. I'm trying to solve it...It's weird to see this problem in a single-machine multi-card environment....

hehebamei commented 3 days ago

I have fixed it. Removing the .cache first.