When we train models on a multi-node cluster, it will raise "RuntimeError: Gloo connectFullMesh failed ..." if the number of nodes setting "export GLOO_SOCKET_IFNAME=bond4" exceeds 60, such as 64. And it works when the "bond4-nodes" is less than or equal to 60.
Are there any restrictions for using the Gloo backend with the bond4 network configuration during training?
When we train models on a multi-node cluster, it will raise "RuntimeError: Gloo connectFullMesh failed ..." if the number of nodes setting "export GLOO_SOCKET_IFNAME=bond4" exceeds 60, such as 64. And it works when the "bond4-nodes" is less than or equal to 60.
Are there any restrictions for using the Gloo backend with the bond4 network configuration during training?