This repository implements a kubernetes operator for Bagua distributed training job which supports static and elastic workloads. See CRD definition.
git clone https://github.com/BaguaSys/operator.git
cd operator
# install crd
kubectl apply -f config/crd/bases/bagua.kuaishou.com_baguas.yaml
go run ./main.go
Install Bagua on an existing Kubernetes cluster.
kubectl apply -f https://raw.githubusercontent.com/BaguaSys/operator/master/deploy/deployment.yaml
Enjoy! Bagua will create resources in namespace bagua
.
You can get demos in config/samples
, and run as follows,
kubectl apply -f config/samples/bagua_v1alpha1_bagua_static.yaml
Verify pods are running
```yaml
kubectl get pods
NAME READY STATUS RESTARTS AGE
bagua-sample-static-master-0 1/1 Running 0 45s
bagua-sample-static-worker-0 1/1 Running 0 45s
bagua-sample-static-worker-1 1/1 Running 0 45s
kubectl apply -f config/samples/bagua_v1alpha1_bagua_elastic.yaml
Verify pods are running
```yaml
kubectl get pods
NAME READY STATUS RESTARTS AGE
bagua-sample-elastic-etcd-0 1/1 Running 0 63s
bagua-sample-elastic-worker-0 1/1 Running 0 63s
bagua-sample-elastic-worker-1 1/1 Running 0 63s
bagua-sample-elastic-worker-2 1/1 Running 0 63s