NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
9.23k stars 2.08k forks source link

[QUESTION] Gloo connectFullMesh failed when the number of nodes setting "export GLOO_SOCKET_IFNAME=bond4" exceeds 60 #877

Open Genlovy-Hoo opened 1 week ago

Genlovy-Hoo commented 1 week ago

When we train models on a multi-node cluster, it will raise "RuntimeError: Gloo connectFullMesh failed ..." if the number of nodes setting "export GLOO_SOCKET_IFNAME=bond4" exceeds 60, such as 64. And it works when the "bond4-nodes" is less than or equal to 60.

Are there any restrictions for using the Gloo backend with the bond4 network configuration during training?