Closed hehebamei closed 3 days ago
How many gpus did you use?
How many gpus did you use?
Generally, our lab has 7GPUs running and I made no changes to the setting files. Our 8th graphics card was banned due to instability, does this matter?
Try setting the batch size to a multiple of 7. such as 126
@hijkzzz its ok to set the batch size to a multiple of 14. But it occurs another issue. How to solve it?
Very thank you much for ur help.
rank2: Traceback (most recent call last):
rank2: File "/project/OpenRLHF/examples/scripts/../train_sft.py", line 180, in
rank2: File "/project/OpenRLHF/examples/scripts/../train_sft.py", line 89, in train rank2: (model, optim, scheduler) = strategy.prepare((model, optim, scheduler)) rank2: File "/root/.local/lib/python3.10/site-packages/openrlhf/utils/deepspeed.py", line 157, in prepare
rank2: File "/root/.local/lib/python3.10/site-packages/openrlhf/utils/deepspeed.py", line 167, in _ds_init_trainmodel rank2: engine, optim, , scheduler = deepspeed.initialize( rank2: File "/root/.local/lib/python3.10/site-packages/deepspeed/init.py", line 176, in initialize rank2: engine = DeepSpeedEngine(args=args, rank2: File "/root/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 262, in init
rank2: File "/root/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1157, in _configure_distributed_model
rank2: File "/root/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1077, in _broadcast_model
rank2: dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)
rank2: File "/root/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
rank2: return func(*args, kwargs)
rank2: File "/root/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 224, in broadcast
rank2: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
rank2: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
rank2: return fn(*args, *kwargs)
rank2: File "/root/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 205, in broadcast
rank2: return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
rank2: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
rank2: return func(args, kwargs)
rank2: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2140, in broadcast
rank2: work = group.broadcast([tensor], opts)
rank2: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer
rank2: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first):
rank2: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x733aaa77a897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
rank2: frame #1:
rank6: File "/project/OpenRLHF/examples/scripts/../train_sft.py", line 89, in train rank6: (model, optim, scheduler) = strategy.prepare((model, optim, scheduler)) rank6: File "/root/.local/lib/python3.10/site-packages/openrlhf/utils/deepspeed.py", line 157, in prepare
rank6: File "/root/.local/lib/python3.10/site-packages/openrlhf/utils/deepspeed.py", line 167, in _ds_init_trainmodel rank6: engine, optim, , scheduler = deepspeed.initialize( rank6: File "/root/.local/lib/python3.10/site-packages/deepspeed/init.py", line 176, in initialize rank6: engine = DeepSpeedEngine(args=args, rank6: File "/root/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 262, in init
rank6: File "/root/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1157, in _configure_distributed_model
rank6: File "/root/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1077, in _broadcast_model
rank6: dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)
rank6: File "/root/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
rank6: return func(*args, kwargs)
rank6: File "/root/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 224, in broadcast
rank6: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
rank6: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
rank6: return fn(*args, *kwargs)
rank6: File "/root/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 205, in broadcast
rank6: return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
rank6: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
rank6: return func(args, kwargs)
rank6: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2140, in broadcast
rank6: work = group.broadcast([tensor], opts)
rank6: torch.distributed.DistBackendError: [6] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer
rank6: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first):
rank6: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7e7562492897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
rank6: frame #1:
@hijkzzz I tried to fix it following this blog before u brought it up, but it does not work. I'm trying to solve it...It's weird to see this problem in a single-machine multi-card environment....
I have fixed it. Removing the .cache first.
I changed the conda environment to docker, but this error still occurs. What should I do? Is this error caused by environmental lib?