BaguaSys / operator

Kubernetes operator for Bagua distributed training job.
https://baguasys.github.io/tutorials/kubernetes-integration/index.html
MIT License
12 stars 5 forks source link

Kubernetes operator for Bagua jobs

This repository implements a kubernetes operator for Bagua distributed training job which supports static and elastic workloads. See CRD definition.

Prerequisites

Installation

Run the operator locally


git clone https://github.com/BaguaSys/operator.git
cd operator

# install crd
kubectl apply -f config/crd/bases/bagua.kuaishou.com_baguas.yaml

go run ./main.go

Deploy the operator

Install Bagua on an existing Kubernetes cluster.

kubectl apply -f https://raw.githubusercontent.com/BaguaSys/operator/master/deploy/deployment.yaml

Enjoy! Bagua will create resources in namespace bagua.

Examples

You can get demos in config/samples, and run as follows,

kubectl apply -f config/samples/bagua_v1alpha1_bagua_static.yaml

Verify pods are running
```yaml

kubectl get pods

NAME                           READY   STATUS    RESTARTS   AGE
bagua-sample-static-master-0   1/1     Running   0          45s
bagua-sample-static-worker-0   1/1     Running   0          45s
bagua-sample-static-worker-1   1/1     Running   0          45s

kubectl apply -f config/samples/bagua_v1alpha1_bagua_elastic.yaml

Verify pods are running
```yaml

kubectl get pods

NAME                            READY   STATUS    RESTARTS   AGE
bagua-sample-elastic-etcd-0     1/1     Running   0          63s
bagua-sample-elastic-worker-0   1/1     Running   0          63s
bagua-sample-elastic-worker-1   1/1     Running   0          63s
bagua-sample-elastic-worker-2   1/1     Running   0          63s