kube-openmpi provides mainly two things:
chart
directory for details.2.1.2-16.04-0.7.0
/ 0.7.0
$(OPENMPI_VERSION)-$(UBUNTU_IMAGE_TAG)-$(KUBE_OPENMPI_VERSION)
$(UBUNTU_IMAGE_TAG)
refers to tags of ubuntu2.1.2-8.0-cudnn7-devel-ubuntu16.04-0.7.0
/ 0.7.0-cuda8.0
2.1.2-9.0-cudnn7-devel-ubuntu16.04-0.7.0
/ 0.7.0-cuda9.0
2.1.2-9.1-cudnn7-devel-ubuntu16.04-0.7.0
/ 0.7.0-cuda9.1
$(OPENMPI_VERSION)-$(CUDA_IMAGE_TAG)-$(KUBE_OPENMPI_VERSION)
$(CUDA_IMAGE_TAG)
refers to tags of nvidia/cuda0.7.0-cuda8.0-nccl2.1.4-1-chainer4.0.0b4-chainermn1.2.0
0.7.0-cuda9.0-nccl2.1.15-1-chainer4.0.0b4-chainermn1.2.0
0.7.0-cuda9.1-nccl2.1.15-1-chainer4.0.0b4-chainermn1.2.0
$(KUBE_OPENMPI_VERSION)-$(CUDA_VERSION)-nccl$(NCCL_CUDA80_PACKAGE_VERSION)-chainer$(CHAINER_VERSION)-chainermn$(CHAINER_MN_VERSION)
# generate temporary key
$ ./gen-ssh-key.sh
# edit your values.yaml
$ $EDITOR values.yaml
$ MPI_CLUSTER_NAME=__CHANGE_ME__
$ KUBE_NAMESPACE=__CHANGE_ME_
$ helm template chart --namespace $KUBE_NAMESPACE --name $MPI_CLUSTER_NAME -f values.yaml -f ssh-key.yaml | kubectl -n $KUBE_NAMESPACE create -f -
# wait until $MPI_CLUSTER_NAME-master is ready
$ kubectl get -n $KUBE_NAMESPACE po $MPI_CLUSTER_NAME-master
# You can run mpiexec now via 'kubectl exec'!
# hostfile is automatically generated and located '/kube-openmpi/generated/hostfile'
$ kubectl -n $KUBE_NAMESPACE exec -it $MPI_CLUSTER_NAME-master -- mpiexec --allow-run-as-root \
--hostfile /kube-openmpi/generated/hostfile \
--display-map -n 4 -npernode 1 \
sh -c 'echo $(hostname):hello'
Data for JOB [43686,1] offset 0
======================== JOB MAP ========================
Data for node: MPI_CLUSTER_NAME-worker-0 Num slots: 2 Max slots: 0 Num procs: 1
Process OMPI jobid: [43686,1] App: 0 Process rank: 0 Bound: UNBOUND
Data for node: MPI_CLUSTER_NAME-worker-1 Num slots: 2 Max slots: 0 Num procs: 1
Process OMPI jobid: [43686,1] App: 0 Process rank: 1 Bound: UNBOUND
Data for node: MPI_CLUSTER_NAME-worker-2 Num slots: 2 Max slots: 0 Num procs: 1
Process OMPI jobid: [43686,1] App: 0 Process rank: 2 Bound: UNBOUND
Data for node: MPI_CLUSTER_NAME-worker-3 Num slots: 2 Max slots: 0 Num procs: 1
Process OMPI jobid: [43686,1] App: 0 Process rank: 3 Bound: UNBOUND
=============================================================
MPI_CLUSTER_NAME-worker-1:hello
MPI_CLUSTER_NAME-worker-2:hello
MPI_CLUSTER_NAME-worker-0:hello
MPI_CLUSTER_NAME-worker-3:hello
MPI workers forms StatefulSets. So, you can scale up or down the cluster.
# scale workers from 4 to 3
$ kubectl -n $KUBE_NAMESPACE scale statefulsets $MPI_CLUSTER_NAME-worker --replicas=3
statefulset "MPI_CLUSTER_NAME-worker" scaled
# Then you can mpiexec again
# hostfile will be updated automatically every 15 seconds in default
$ kubectl -n $KUBE_NAMESPACE exec -it $MPI_CLUSTER_NAME-master -- mpiexec --allow-run-as-root \
--hostfile /kube-openmpi/generated/hostfile \
--display-map -n 3 -npernode 1 \
sh -c 'echo $(hostname):hello'
...
MPI_CLUSTER_NAME-worker-0:hello
MPI_CLUSTER_NAME-worker-2:hello
MPI_CLUSTER_NAME-worker-1:hello
$ helm template chart --namespace $KUBE_NAMESPACE --name $MPI_CLUSTER_NAME -f values.yaml -f ssh-key.yaml | kubectl -n $KUBE_NAMESPACE delete -f -
please edit image
section in values.yaml
image:
repository: yourname/kube-openmpi-based-custom-image
tag: latest
It expects that your custom image is based on our base image (everpeace/kube-openmpi) and does NOT change any ssh/sshd configurations define in image/Dockerfile
on your custom image.
Please refer to Custom ChainerMN image example on kube-openmpi for details.
Please create a Secret
of docker-registry
type to your namespace by referring here.
And then, you can specify the secret name in your values.yaml
:
image:
repository: <your_registry>/<your_org>/<your_image_name>
tag: <your_tag>
pullSecrets:
- name: <docker_registry_secret_name>
kube-openmpi supports to import your codes hosted by github into your containers. To do it, please edit appCodesToSync
section in values.yaml
. You can define multiple github repositories.
appCodesToSync:
- name: your-app-name
gitRepo: https://github.com/org/your-app-name.git
gitBranch: master
fetchWaitSecond: "120"
mountPath: /repo
When your code are in private git repository. The secret repo must be able to access via ssh.
Please remember this feature requires securityContext.runAs: 0
for side-car containers fetching your code into mpi containers.
You need to register ssh key to the repo. I recommend you to set up Deploy Keys
for your secret repo because it is valid only for the target repository and read-only.
Create generic
type Secret
which has a key ssh
and its value is the private key.
$ kubectl create -n $KUBE_NAMESPACE secret generic <git-sync-cred-name> --from-file=ssh=<deploy-private-key-file>
Then, you can define appCodesToSync
entries with the secret
- name: <your-secret-repo>
gitRepo: git@<git-server>:<your-org>/<your-secret-repo>.git
gitBranch: master
fetchWaitSecond: "120"
mountPath: <mount-point>
gitSecretName: <git-sync-cred-name>
At default, kube-openmpi runs your mpi cluster as root user. However, from security standpoint, you might want to run your mpi-cluster as non-root user. There is two way to achieve this.
openmpi
user and groupkube-openmpi base docker images on DockerHub ships such normal user openmpi
with uid=1000
/gid=1000
. To make the user run your mpi-cluster, edit your values.yaml
to specify SecurityContext like below:
# values.yaml
...
mpiMaster:
securityContext:
runAsUser: 1000
fsGroup: 1000
...
mpiWorkers:
securityContext:
runAsUser: 1000
fsGroup: 1000
Then you can run mpiexec
as openmpi
user. You would need to tear down and re-deploy your mpi-cluster if you had kube-openmpi cluster already.
$ kubectl -n $KUBE_NAMESPACE exec -it $MPI_CLUSTER_NAME-master -- mpiexec \
--hostfile /kube-openmpi/generated/hostfile \
--display-map -n 4 -npernode 1 \
sh -c 'echo $(hostname):hello'
...
You need to build your own custom base image because the custom user with your desired uid/gid must exists(embedded) in the docker image. To do this, just run make
with several options below.
$ cd images
$ make REPOSITORY=<your_org>/<your_repo> SSH_USER=<username> SSH_UID=<uid> SSH_GID=<gid>
This creates ubuntu based image, cuda8(cudnn7) image and cuda9(cudnn7) image.
And then, set the image
in your values.yaml
and set your uid/gid to runAsUser
/fsGroup
as the previous section.
As stated kubeflow/tf-operator#165, spawning multiple kube-openmpi cluster causes deadlock. To prevent it, you might want gang-scheduling
(i.e schedule multiple pods all together) in kubernetes. Currently, kubernetes-incubator/kube-arbitrator support it by using kube-batchd
scheduler and PodDisruptionBudget
.
Please follow the steps:
Edit mpiWorkers.customScheduling
section in your values.yaml
like this.
mpiWorkers:
customScheduling:
enabled: true
schedulerName: <your_kube-batchd_scheduler_name>
podDisruptionBudget:
enabled: true
Deploy your kube-openmpi cluster.
We published Chainer,ChainerMN(with CuPy and NCCL2) based image. Let's use it. In this example, we run train_mnist
example in ChainerMN repo. If you wanted to build your own docker image. Please refer to Custom ChainerMN image example on kube-openmpi for details.
edit your values.yaml
so that
2
mpi workers and assign 1
GPU resource to each mpi worker.appCodesToSync
entry to run train_mnist
example with ChainerMN.image:
repository: everpeace/kube-openmpi
tag: 0.7.0-cuda8.0-nccl2.1.4-1-chainer4.0.0b4-chainermn1.2.0
...
mpiWorkers:
num: 2
resources:
limits:
nvidia.com/gpu: 1
...
appCodesToSync:
- name: chainermn
gitRepo: https://github.com/chainer/chainermn.git
gitBranch: master
fetchWaitSecond: "120"
mountPath: /chainermn-examples
subPath: chainermn/examples
...
Deploy your kube-openmpi cluster
$ MPI_CLUSTER_NAME=__CHANGE_ME__
$ KUBE_NAMESPACE=__CHANGE_ME_
$ helm template chart --namespace $KUBE_NAMESPACE --name $MPI_CLUSTER_NAME -f values.yaml -f ssh-key.yaml | kubectl -n $KUBE_NAMESPACE create -f -
Run train_mnist
with GPU
$ kubectl -n $KUBE_NAMESPACE exec -it $MPI_CLUSTER_NAME-master -- mpiexec --allow-run-as-root \
--hostfile /kube-openmpi/generated/hostfile \
--display-map -n 2 -npernode 1 \
python3 /chainermn-examples/mnist/train_mnist.py -g
======================== JOB MAP ========================
Data for node: MPI_CLUSTER_NAME-worker-0 Num slots: 8 Max slots: 0 Num procs: 1
Process OMPI jobid: [28697,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0-1]]:[BB/../../..][../../../..]
Data for node: MPI_CLUSTER_NAME-worker-1 Num slots: 8 Max slots: 0 Num procs: 1
Process OMPI jobid: [28697,1] App: 0 Process rank: 1 Bound: socket 0[core 0[hwt 0-1]]:[BB/../../..][../../../..]
=============================================================
==========================================
Num process (COMM_WORLD): 2
Using GPUs
Using hierarchical communicator
Num unit: 1000
Num Minibatch-size: 100
Num epoch: 20
==========================================
...
1 0.224002 0.102322 0.9335 0.9695 17.1341
2 0.0733692 0.0672879 0.977967 0.9765 24.7188
...
20 0.00531046 0.105093 0.998267 0.9799 160.794
init.sh
so that non-root user won't fail to run init.sh
start_sshd.sh
to init.sh
__. When ONE_SHOT
was true
, init.sh
will execute user command which as passed an arguments to init.sh
just after sshd was up.oneShot
mode is supported. Auto scale down workers feature is also supported.mpiMaster.oneShot
mode, mpiMaster.oneShot.command
will be automatically executed in master once cluster was up. if mpiMaster.oneShot.autoScaleDownWorkers
was enabled and mpiMaster.oneShot.command
successfully completed (i.e. return code was 0
), worker cluster will be scaled down to 0
.gang-scheduling
(schedule a group of pods all together) for mpi workers is now available via kube-batchd
in kube-arbitrator
.volumes
/volumeMounts
Run
step simpler. Changed to use kubectl exec -it -- mpiexec
directly.root
can ssh to both mpi-master and mpi-workers when containers run as rootroot
at defaultopenmpi
user as before by setting runAsUser
/fsGroup
in values.yaml
mpiexec
command!orte_keep_fqdn_hostnames=t
to openmpi-mca-params.conf
CustomPodDNS
feature gate!!bootstrap
job was removedhostfile-updater
was introduced. Now you can scale up/down your mpi cluster dynamically!mpi-master
pod as a side-car container.hostfile
was moved to /kube-openmpi/generated/hostfile
securityContext
(e.g. securityContext.runAs
) (#1).securityContext
(#1)mca:mpi:base:param:mpi_built_with_cuda_support:value:true
when cuda based image was built. You can NOT use open MPI with CUDA on 0.1.0
. So, please use 0.2.0
.resources
in values.yaml
was ignored.workers
can resolve master
in DNS.