Open everpeace opened 6 years ago
Hi @everpeace if I use kubectl exec to initialise the MPI processes, will it(kubectl exec) be also used for worker-to-worker communication, say, synced SGD training among multiple GPUs across multiple nodes? If that's the case, then it might have some negative impact on some large model training in a large cluster whose apiServer is supposedly busy all the time. Is my understanding correct?
I suspected so first. However, plm_rsh_agent seems to be used in initialization. Consequently, worker to worker communication seems to happen pod to pod directly. So, probably, plm_rsh_agent won't cause a performance impact.
ssh less mpi on k8s is used in kubeflow/mpi-operator, kubeflow/chainer-operator
@everpeace Thanks for the reply. After a few days' debugging, now I am pretty sure that kubectl exec as a RSH agent is only used during initialization and it's a good replacement for password-less SSH.
@EdwardZhang88 by the way, would you often use kube-openmpi to execute your mpi jobs??
Actually, I haven't been active on this project so far, but, PR is always welcome 👍
I confirmed
-mca plm_rsh_agent
can realize ssh-less kube-openmpi environment. We can usekubectl exec
instead of default ssh.However,
kubectl exec
traffic through kube-apiserver during its execution. So, we have to consider performance impact for open-mpi execution.