Open allendred opened 1 year ago
现在有两台机器,打算测试一下多机多卡的训练,选择了large-chinese,现在训练的时候出现了问题
192.168.83.245: 595d69b310a0:48344:48344 [0] NCCL INFO Launch mode Parallel 192.168.83.245: 595d69b310a0:48345:48345 [1] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f0753200000 recvbuff 0x7f0753200000 count 24 datatype 0 op 0 root 0 comm 0x7f0708002f70 [nranks=2] stream 0x842b850 192.168.83.245: 595d69b310a0:48345:48345 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x47b0a620 192.168.83.245: 595d69b310a0:48344:48344 [0] NCCL INFO Broadcast: opCount 0 sendbuff 0x7fc321200000 recvbuff 0x7fc321200000 count 24 datatype 0 op 0 root 0 comm 0x7fc2d4002f70 [nranks=2] stream 0x8dcb290 192.168.83.245: 595d69b310a0:48344:48344 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x475658e0 192.168.83.235: > number of parameters on model parallel rank 0: 178954240 192.168.83.235: > number of parameters on model parallel rank 1: 178954240 192.168.83.235: DeepSpeed is enabled. 192.168.83.235: [2023-05-15 10:22:08,734] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.9.2, git-hash=unknown, git-branch=unknown 192.168.83.235: 94e3c7d9b96c:36808:37224 [0] NCCL INFO bootstrap.cc:107 Mem Alloc Size 28 pointer 0x7fd310000b20 192.168.83.235: 94e3c7d9b96c:36808:37225 [0] NCCL INFO init.cc:260 Mem Alloc Size 18872 pointer 0x7fd304002f70 192.168.83.235: 94e3c7d9b96c:36808:37225 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7fd30400d180 192.168.83.235: 94e3c7d9b96c:36808:37225 [0] NCCL INFO init.cc:279 Cuda Host Alloc Size 4 pointer 0x7fd561100200 192.168.83.235: 94e3c7d9b96c:36808:37225 [0] NCCL INFO init.cc:286 Mem Alloc Size 311296 pointer 0x7fd30400d620 192.168.83.235: 94e3c7d9b96c:36808:37225 [0] NCCL INFO include/enqueue.h:50 Mem Alloc Size 24 pointer 0x7fd304059630 192.168.83.235: 94e3c7d9b96c:36808:37225 [0] NCCL INFO init.cc:305 Mem Alloc Size 8 pointer 0x7fd304059680 192.168.83.235: 94e3c7d9b96c:36808:37225 [0] NCCL INFO init.cc:306 Mem Alloc Size 8 pointer 0x7fd3040596a0 192.168.83.235: 94e3c7d9b96c:36808:37225 [0] NCCL INFO init.cc:309 Mem Alloc Size 16 pointer 0x7fd3040596c0 192.168.83.235: 94e3c7d9b96c:36808:37225 [0] NCCL INFO init.cc:310 Mem Alloc Size 16 pointer 0x7fd3040596e0 192.168.83.235: 94e3c7d9b96c:36808:37225 [0] NCCL INFO bootstrap.cc:330 Mem Alloc Size 128 pointer 0x7fd304059700 192.168.83.235: 94e3c7d9b96c:36808:37224 [0] NCCL INFO bootstrap.cc:121 Mem Alloc Size 56 pointer 0x7fd310008420 192.168.83.235: 94e3c7d9b96c:36808:37224 [0] NCCL INFO bootstrap.cc:122 Mem Alloc Size 56 pointer 0x7fd310008460 192.168.83.235: NCCL version 2.10.3+cuda11.3 192.168.83.235: 94e3c7d9b96c:36809:37226 [0] NCCL INFO bootstrap.cc:107 Mem Alloc Size 28 pointer 0x7f7aa0000b20 192.168.83.235: 94e3c7d9b96c:36809:37227 [1] NCCL INFO init.cc:260 Mem Alloc Size 18872 pointer 0x7f7a98002f70 192.168.83.235: 94e3c7d9b96c:36809:37227 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f7a9800d180 192.168.83.235: 94e3c7d9b96c:36809:37227 [1] NCCL INFO init.cc:279 Cuda Host Alloc Size 4 pointer 0x7f7c0b100200 192.168.83.235: 94e3c7d9b96c:36809:37227 [1] NCCL INFO init.cc:286 Mem Alloc Size 311296 pointer 0x7f7a9800d620 192.168.83.235: 94e3c7d9b96c:36809:37227 [1] NCCL INFO include/enqueue.h:50 Mem Alloc Size 24 pointer 0x7f7a98059630 192.168.83.235: 94e3c7d9b96c:36809:37227 [1] NCCL INFO init.cc:305 Mem Alloc Size 8 pointer 0x7f7a98059680 192.168.83.235: 94e3c7d9b96c:36809:37227 [1] NCCL INFO init.cc:306 Mem Alloc Size 8 pointer 0x7f7a980596a0 192.168.83.235: 94e3c7d9b96c:36809:37227 [1] NCCL INFO init.cc:309 Mem Alloc Size 16 pointer 0x7f7a980596c0 192.168.83.235: 94e3c7d9b96c:36809:37227 [1] NCCL INFO init.cc:310 Mem Alloc Size 16 pointer 0x7f7a980596e0 192.168.83.235: 94e3c7d9b96c:36809:37227 [1] NCCL INFO bootstrap.cc:330 Mem Alloc Size 128 pointer 0x7f7a98059700 192.168.83.235: 94e3c7d9b96c:36809:37226 [0] NCCL INFO bootstrap.cc:121 Mem Alloc Size 56 pointer 0x7f7aa0008420 192.168.83.235: 94e3c7d9b96c:36809:37226 [0] NCCL INFO bootstrap.cc:122 Mem Alloc Size 56 pointer 0x7f7aa0008460 192.168.83.245: 595d69b310a0:48345:48663 [1] NCCL INFO init.cc:260 Mem Alloc Size 18872 pointer 0x7f0600002f70 192.168.83.245: 595d69b310a0:48345:48663 [1] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f060000d180 192.168.83.245: 595d69b310a0:48345:48663 [1] NCCL INFO init.cc:279 Cuda Host Alloc Size 4 pointer 0x7f070d100200 192.168.83.245: 595d69b310a0:48345:48663 [1] NCCL INFO init.cc:286 Mem Alloc Size 311296 pointer 0x7f060000d620 192.168.83.245: 595d69b310a0:48345:48663 [1] NCCL INFO include/enqueue.h:50 Mem Alloc Size 24 pointer 0x7f0600059630 192.168.83.245: 595d69b310a0:48345:48663 [1] NCCL INFO init.cc:305 Mem Alloc Size 8 pointer 0x7f0600059680 192.168.83.245: 595d69b310a0:48345:48663 [1] NCCL INFO init.cc:306 Mem Alloc Size 8 pointer 0x7f06000596a0 192.168.83.245: 595d69b310a0:48345:48663 [1] NCCL INFO init.cc:309 Mem Alloc Size 16 pointer 0x7f06000596c0 192.168.83.245: 595d69b310a0:48345:48663 [1] NCCL INFO init.cc:310 Mem Alloc Size 16 pointer 0x7f06000596e0 192.168.83.245: 595d69b310a0:48345:48663 [1] NCCL INFO bootstrap.cc:330 Mem Alloc Size 128 pointer 0x7f0600059700 192.168.83.245: 595d69b310a0:48344:48664 [0] NCCL INFO init.cc:260 Mem Alloc Size 18872 pointer 0x7fc164002f70 192.168.83.245: 595d69b310a0:48344:48664 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7fc16400d180 192.168.83.245: 595d69b310a0:48344:48664 [0] NCCL INFO init.cc:279 Cuda Host Alloc Size 4 pointer 0x7fc2d9100200 192.168.83.245: 595d69b310a0:48344:48664 [0] NCCL INFO init.cc:286 Mem Alloc Size 311296 pointer 0x7fc16400d620 192.168.83.245: 595d69b310a0:48344:48664 [0] NCCL INFO include/enqueue.h:50 Mem Alloc Size 24 pointer 0x7fc164059630 192.168.83.245: 595d69b310a0:48344:48664 [0] NCCL INFO init.cc:305 Mem Alloc Size 8 pointer 0x7fc164059680 192.168.83.245: 595d69b310a0:48344:48664 [0] NCCL INFO init.cc:306 Mem Alloc Size 8 pointer 0x7fc1640596a0 192.168.83.245: 595d69b310a0:48344:48664 [0] NCCL INFO init.cc:309 Mem Alloc Size 16 pointer 0x7fc1640596c0 192.168.83.245: 595d69b310a0:48344:48664 [0] NCCL INFO init.cc:310 Mem Alloc Size 16 pointer 0x7fc1640596e0 192.168.83.245: 595d69b310a0:48344:48664 [0] NCCL INFO bootstrap.cc:330 Mem Alloc Size 128 pointer 0x7fc164059700 192.168.83.245: 192.168.83.245: 595d69b310a0:48344:48664 [0] include/socket.h:409 NCCL WARN Net : Connect to 172.17.0.6<34147> failed : No route to host 192.168.83.245: 595d69b310a0:48344:48664 [0] NCCL INFO bootstrap.cc:360 -> 2 192.168.83.245: 595d69b310a0:48344:48664 [0] NCCL INFO init.cc:501 -> 2 192.168.83.245: 595d69b310a0:48344:48664 [0] NCCL INFO init.cc:904 -> 2 192.168.83.245: 595d69b310a0:48344:48664 [0] NCCL INFO group.cc:72 -> 2 [Async thread] 192.168.83.245: 192.168.83.245: 595d69b310a0:48345:48663 [1] include/socket.h:409 NCCL WARN Net : Connect to 172.17.0.6<41918> failed : No route to host 192.168.83.245: 595d69b310a0:48345:48663 [1] NCCL INFO bootstrap.cc:360 -> 2 192.168.83.245: 595d69b310a0:48345:48663 [1] NCCL INFO init.cc:501 -> 2 192.168.83.245: 595d69b310a0:48345:48663 [1] NCCL INFO init.cc:904 -> 2 192.168.83.245: 595d69b310a0:48344:48344 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x4b4c9c50 192.168.83.245: 595d69b310a0:48345:48663 [1] NCCL INFO group.cc:72 -> 2 [Async thread] 192.168.83.245: 595d69b310a0:48345:48345 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x692de5e0 192.168.83.245: Traceback (most recent call last): 192.168.83.245: File "/workspace/llm/glm/GLM/pretrain_glm.py", line 678, in <module> 192.168.83.245: main() 192.168.83.245: File "/workspace/llm/glm/GLM/pretrain_glm.py", line 582, in main 192.168.83.245: model, optimizer, lr_scheduler = setup_model_and_optimizer(args) 192.168.83.245: File "/workspace/llm/glm/GLM/train_utils.py", line 254, in setup_model_and_optimizer 192.168.83.245: model, optimizer, _, _ = deepspeed.initialize( 192.168.83.245: File "/home/gnn/conda/envs/gnn/lib/python3.10/site-packages/deepspeed/__init__.py", line 165, in initialize 192.168.83.245: engine = DeepSpeedEngine(args=args, 192.168.83.245: File "/home/gnn/conda/envs/gnn/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 266, in __init__ 192.168.83.245: self._configure_distributed_model(model) 192.168.83.245: File "/home/gnn/conda/envs/gnn/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1073, in _configure_distributed_model 192.168.83.245: self._broadcast_model() 192.168.83.245: File "/home/gnn/conda/envs/gnn/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1003, in _broadcast_model 192.168.83.245: dist.broadcast(p, groups._get_broadcast_src_rank(), group=self.data_parallel_group) 192.168.83.245: File "/home/gnn/conda/envs/gnn/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 120, in log_wrapper 192.168.83.245: return func(*args, **kwargs) 192.168.83.245: File "/home/gnn/conda/envs/gnn/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 217, in broadcast 192.168.83.245: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 192.168.83.245: File "/home/gnn/conda/envs/gnn/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 118, in broadcast 192.168.83.245: return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 192.168.83.245: File "/home/gnn/conda/envs/gnn/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1197, in broadcast 192.168.83.245: work = group.broadcast([tensor], opts) 192.168.83.245: RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3 192.168.83.245: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer. 192.168.83.245: Traceback (most recent call last): 192.168.83.245: File "/workspace/llm/glm/GLM/pretrain_glm.py", line 678, in <module> 192.168.83.245: main() 192.168.83.245: File "/workspace/llm/glm/GLM/pretrain_glm.py", line 582, in main 192.168.83.245: model, optimizer, lr_scheduler = setup_model_and_optimizer(args) 192.168.83.245: File "/workspace/llm/glm/GLM/train_utils.py", line 254, in setup_model_and_optimizer 192.168.83.245: model, optimizer, _, _ = deepspeed.initialize( 192.168.83.245: File "/home/gnn/conda/envs/gnn/lib/python3.10/site-packages/deepspeed/__init__.py", line 165, in initialize 192.168.83.245: engine = DeepSpeedEngine(args=args, 192.168.83.245: File "/home/gnn/conda/envs/gnn/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 266, in __init__ 192.168.83.245: self._configure_distributed_model(model) 192.168.83.245: File "/home/gnn/conda/envs/gnn/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1073, in _configure_distributed_model 192.168.83.245: self._broadcast_model() 192.168.83.245: File "/home/gnn/conda/envs/gnn/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1003, in _broadcast_model 192.168.83.245: dist.broadcast(p, groups._get_broadcast_src_rank(), group=self.data_parallel_group) 192.168.83.245: File "/home/gnn/conda/envs/gnn/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 120, in log_wrapper 192.168.83.245: return func(*args, **kwargs) 192.168.83.245: File "/home/gnn/conda/envs/gnn/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 217, in broadcast 192.168.83.245: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 192.168.83.245: File "/home/gnn/conda/envs/gnn/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 118, in broadcast 192.168.83.245: return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 192.168.83.245: File "/home/gnn/conda/envs/gnn/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1197, in broadcast 192.168.83.245: work = group.broadcast([tensor], opts) 192.168.83.245: RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3 192.168.83.245: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer. 192.168.83.245: [2023-05-15 10:22:12,942] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 48344 192.168.83.245: [2023-05-15 10:22:12,943] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 48345 192.168.83.245: [2023-05-15 10:22:12,959] [ERROR] [launch.py:434:sigkill_handler] ['/home/gnn/conda/envs/gnn/bin/python', '-u', 'pretrain_glm.py', '--local_rank=1', '--block-lm', '--task-mask', '--bert-prob', '0.4', '--gap-sentence-prob', '0.3', '--avg-block-length', '3', '--gpt-min-ratio', '0.25', '--block-mask-prob', '0.1', '--short-seq-prob', '0.02', '--experiment-name', 'blocklm-large-chinese', '--model-parallel-size', '2', '--num-layers', '24', '--hidden-size', '1024', '--num-attention-heads', '16', '--seq-length', '512', '--max-position-embeddings', '1024', '--save', '/dataset/fd5061f6/english_data/checkpoints', '--log-interval', '50', '--eval-interval', '1000', '--save-interval', '2000', '--train-iters', '10000', '--train-data', 'wiki_ch', '--resume-dataloader', '--no-lazy-loader', '--tokenizer-type', 'ChineseSPTokenizer', '--fix-command-token', '--split', '949,50,1', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--lr-decay-ratio', '0.1', '--lr-decay-iters', '200000', '--warmup', '0.04', '--checkpoint-activations', '--deepspeed-activation-checkpointing', '--fp16', '--deepspeed', '--deepspeed_config', '/workspace/llm/glm/GLM/config/config_block_large_chinese.json'] exits with return code = 1 pdsh@94e3c7d9b96c: 192.168.83.245: ssh exited with exit code 1
同问,你后来解决了吗?
需要在deepspeed设置里面设置网卡号
同问,你后来解决了吗? 需要在deepspeed设置里面设置网卡号
哦哦,好的,现在已解决,谢谢啦🙏
现在有两台机器,打算测试一下多机多卡的训练,选择了large-chinese,现在训练的时候出现了问题