Closed frankxyy closed 2 years ago
两台机器可以互相无密码ssh,不过ssh后的shell type会有变化
两台机器可以互相无密码ssh,不过ssh后的shell type会有变化
先指定每个机子相同的卡数试试,然后两个机子能不能互相ping到
可以先看一下两台机器能不能ping通, 然后看看两台机器有没有开proxy, 如果开了的话 可以先关掉.
unset http_proxy
unset https_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
然后最好两台机器的卡数是一致的, 在Node0上面也用两卡
F20221031 10:52:42.193126 106986 eager_nccl_comm_manager.cpp:75] Check failed: ncclCommInitRank(comm, device_vec.size(), nccl_unique_id, rank) : unhandled system error (2). To see more detail, please run OneFlow with system variable NCCL_DEBUG=INFO Check failure stack trace: @ 0x7fded300c00a google::LogMessage::Fail() @ 0x7fded300c2f2 google::LogMessage::SendToLog() @ 0x7fded300bb77 google::LogMessage::Flush() @ 0x7fded300e6e9 google::LogMessageFatal::~LogMessageFatal() @ 0x7fdecb89977d oneflow::(anonymous namespace)::CreateNcclComm() @ 0x7fdecb89b401 oneflow::EagerNcclCommMgr::GetCommForDevice() @ 0x7fdeccd3b220 oneflow::(anonymous namespace)::EagerNcclOpKernelCache::Init() @ 0x7fdeccd2df7f oneflow::(anonymous namespace)::InitEagerNcclOpKernelCache() @ 0x7fdecdc05fa0 oneflow::one::StatefulOpKernel::TryInitOpKernelStateAndCache() @ 0x7fdec93acde5 oneflow::vm::OpCallInstructionType::Compute() @ 0x7fdecc7f4c31 oneflow::vm::EventRecordedEpStreamType::Run() @ 0x7fdecc7f8ab3 oneflow::vm::ThreadCtx::TryReceiveAndRun() @ 0x7fdecc7faaf0 oneflow::(anonymous namespace)::WorkerLoop()
测试过机器之间可以ping通,代理也都关了。每个node上用的gpu卡数改为一样后,报错有所变化,看起来是nccl初始化报错了?
NCCL debug打开后log:
m5-autorl-test01:47324:48509 [0] NCCL INFO Bootstrap : Using bond0:172.27.231.79<0> m5-autorl-test01:47324:48509 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation m5-autorl-test01:47324:48509 [0] NCCL INFO Failed to open libibverbs.so[.1] m5-autorl-test01:47324:48509 [0] NCCL INFO NET/Socket : Using [0]bond0:172.27.231.79<0> [1]veth8739016:fe80::5804:edff:fe5b:6570%veth8739016<0> m5-autorl-test01:47324:48509 [0] NCCL INFO Using network Socket NCCL version 2.12.10+cuda11.2 m5-autorl-test01:47325:48508 [1] NCCL INFO Bootstrap : Using bond0:172.27.231.79<0> m5-autorl-test01:47325:48508 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation m5-autorl-test01:47325:48508 [1] NCCL INFO Failed to open libibverbs.so[.1] m5-autorl-test01:47325:48508 [1] NCCL INFO NET/Socket : Using [0]bond0:172.27.231.79<0> [1]veth8739016:fe80::5804:edff:fe5b:6570%veth8739016<0> m5-autorl-test01:47325:48508 [1] NCCL INFO Using network Socket m5-autorl-test01:47325:48508 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. m5-autorl-test01:47325:48508 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. m5-autorl-test01:47325:48508 [1] NCCL INFO Could not enable P2P between dev 1(=3e000) and dev 0(=3d000) m5-autorl-test01:47325:48508 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. m5-autorl-test01:47325:48508 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. m5-autorl-test01:47325:48508 [1] NCCL INFO Could not enable P2P between dev 0(=3d000) and dev 1(=3e000) m5-autorl-test01:47325:48508 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. m5-autorl-test01:47325:48508 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. m5-autorl-test01:47325:48508 [1] NCCL INFO Could not enable P2P between dev 1(=3e000) and dev 0(=3d000) m5-autorl-test01:47325:48508 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. m5-autorl-test01:47325:48508 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. m5-autorl-test01:47325:48508 [1] NCCL INFO Could not enable P2P between dev 0(=3d000) and dev 1(=3e000) m5-autorl-test01:47325:48508 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff00,000fffff m5-autorl-test01:47324:48509 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. m5-autorl-test01:47324:48509 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. m5-autorl-test01:47324:48509 [0] NCCL INFO Could not enable P2P between dev 1(=3e000) and dev 0(=3d000) m5-autorl-test01:47324:48509 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. m5-autorl-test01:47324:48509 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. m5-autorl-test01:47324:48509 [0] NCCL INFO Could not enable P2P between dev 0(=3d000) and dev 1(=3e000) m5-autorl-test01:47324:48509 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. m5-autorl-test01:47324:48509 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. m5-autorl-test01:47324:48509 [0] NCCL INFO Could not enable P2P between dev 1(=3e000) and dev 0(=3d000) m5-autorl-test01:47324:48509 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. m5-autorl-test01:47324:48509 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. m5-autorl-test01:47324:48509 [0] NCCL INFO Could not enable P2P between dev 0(=3d000) and dev 1(=3e000) m5-autorl-test01:47324:48509 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff m5-autorl-test01:47324:48509 [0] NCCL INFO Channel 00/02 : 0 1 2 3 m5-autorl-test01:47325:48508 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 m5-autorl-test01:47324:48509 [0] NCCL INFO Channel 01/02 : 0 1 2 3 m5-autorl-test01:47324:48509 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/-1/-1->0->2 m5-autorl-test01:47325:48508 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. m5-autorl-test01:47325:48508 [1] NCCL INFO Could not enable P2P between dev 1(=3e000) and dev 0(=3d000) m5-autorl-test01:47324:48509 [0] NCCL INFO Channel 00/0 : 3[38000] -> 0[3d000] [receive] via NET/Socket/0 m5-autorl-test01:47325:48508 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. m5-autorl-test01:47325:48508 [1] NCCL INFO Could not enable P2P between dev 1(=3e000) and dev 0(=3d000) m5-autorl-test01:47325:48508 [1] NCCL INFO Channel 00/0 : 1[3e000] -> 2[37000] [send] via NET/Socket/0 m5-autorl-test01:47324:48509 [0] NCCL INFO Channel 01/0 : 3[38000] -> 0[3d000] [receive] via NET/Socket/0 m5-autorl-test01:47324:48509 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. m5-autorl-test01:47324:48509 [0] NCCL INFO Could not enable P2P between dev 0(=3d000) and dev 1(=3e000) m5-autorl-test01:47324:48509 [0] NCCL INFO Channel 00 : 0[3d000] -> 1[3e000] via direct shared memory m5-autorl-test01:47324:48509 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. m5-autorl-test01:47324:48509 [0] NCCL INFO Could not enable P2P between dev 0(=3d000) and dev 1(=3e000) m5-autorl-test01:47324:48509 [0] NCCL INFO Channel 01 : 0[3d000] -> 1[3e000] via direct shared memory m5-autorl-test01:47325:48508 [1] NCCL INFO Channel 01/0 : 1[3e000] -> 2[37000] [send] via NET/Socket/0 m5-autorl-test01:47324:48509 [0] NCCL INFO Connected all rings m5-autorl-test01:47324:48509 [0] NCCL INFO Channel 00/0 : 2[37000] -> 0[3d000] [receive] via NET/Socket/0 m5-autorl-test01:47324:48509 [0] NCCL INFO Channel 01/0 : 2[37000] -> 0[3d000] [receive] via NET/Socket/0 m5-autorl-test01:47324:48509 [0] NCCL INFO Channel 00/0 : 0[3d000] -> 2[37000] [send] via NET/Socket/0 m5-autorl-test01:47324:48509 [0] NCCL INFO Channel 01/0 : 0[3d000] -> 2[37000] [send] via NET/Socket/0 m5-autorl-test01:47325:48698 [1] NCCL INFO include/net.h:25 -> 2 m5-autorl-test01:47325:48698 [1] NCCL INFO transport/net.cc:515 -> 2 m5-autorl-test01:47325:48698 [1] NCCL INFO proxy.cc:914 -> 2
m5-autorl-test01:47325:48698 [1] proxy.cc:1042 NCCL WARN [Proxy Service 1] Failed to execute operation Connect from rank 1, retcode 2
m5-autorl-test01:47325:48508 [1] misc/socket.cc:523 NCCL WARN Net : Connection closed by remote peer m5-autorl-test01<54529> m5-autorl-test01:47325:48508 [1] NCCL INFO misc/socket.cc:531 -> 2 m5-autorl-test01:47325:48508 [1] NCCL INFO misc/socket.cc:543 -> 2 m5-autorl-test01:47325:48508 [1] NCCL INFO proxy.cc:805 -> 2
m5-autorl-test01:47325:48508 [1] proxy.cc:808 NCCL WARN Proxy Call to rank 1 failed (Connect) m5-autorl-test01:47325:48508 [1] NCCL INFO transport/net.cc:269 -> 2 m5-autorl-test01:47325:48508 [1] NCCL INFO transport.cc:127 -> 2 m5-autorl-test01:47325:48508 [1] NCCL INFO init.cc:730 -> 2 m5-autorl-test01:47325:48508 [1] NCCL INFO init.cc:915 -> 2 m5-autorl-test01:47325:48508 [1] NCCL INFO init.cc:951 -> 2 m5-autorl-test01:47325:48508 [1] NCCL INFO init.cc:964 -> 2
好像可以了,我限制了 NCCL_SOCKET_IFNAME=bond0 后起起来了
好的 如果解决了可以关掉此issue
还有个问题,能否支持不同节点上,显卡数量不同的情况?
好像可以了,我限制了 NCCL_SOCKET_IFNAME=bond0 后起起来了
这个问题定位和解决的原理是什么呢
好像可以了,我限制了 NCCL_SOCKET_IFNAME=bond0 后起起来了
这个问题定位和解决的原理是什么呢
https://github.com/NVIDIA/nccl/issues/697
参考这个issue
还有个问题,能否支持不同节点上,显卡数量不同的情况?
目前还不支持。多机启动目前假定了每个机器上的 gpu num per node 是一致的。未来可以支持非对称的情形,甚至可以考虑支持异构的集群(不同机器的 gpu 型号不一样)
单机正常,多机(两台机器)跑后,node0启动正常,node1启动报错
启动命令: node 0: NODE=2 NODE_RANK=0 ADDR=172.27.231.79 PORT=12345 bash tools/train.sh tools/train_net.py configs/bert_large_pretrain.py 8 node 1: cuda_visible_devices=2,4 NODE=2 NODE_RANK=1 ADDR=172.27.231.79 PORT=12345 bash tools/train.sh tools/train_net.py configs/bert_large_pretrain.py 2