This repo shows how to train a model on CPU instances in a Kubernetes cluster by using Kubeflow/TFJob training operator and Kubeflow/Katib and a Deep Learning Container.
TFJob is the Kubeflow implementation of Kubernetes custom resource that is used to run (distributed) TensorFlow training jobs on Kubernetes.
Katib is a Kubernetes-native project for automated machine learning (AutoML). Katib supports Hyperparameter Tuning, Early Stopping and Neural Architecture Search.
This tutorial guides you through tuning and training a classification model on the MNIST dataset with Keras in a single node CPU instance running containerized training modules (Keras framework).
This work forms the first step in building out training and inference workflows with Argo.
Create Kind cluster with Kubernetes v1.25.2
kind create cluster --config kind-config.yaml
echo -e "\nKind cluster has been created\n"
Set context for kubectl
kubectl config use-context kind-kind
set up minio and create bucket for argo workflows
kustomize build | kubectl apply -f -
kubectl port-forward svc/minio 9999:9001
http://localhost:9999
Training Operators Deploy TFJob operator standalone
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.5.0"
Hyperparameter Tuning Deploy Katib standalone components.
echo -e "\nDeploying Katib components\n"
kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=master"
Containerize the MNIST classifier code. Build and push the docker image to ECR by running ./build-and-push.sh
in the the mnist folder.
To start training, deploy the TFJob configuration file using kubectl.
kubectl apply -f pv.yaml
kubectl apply -f pvc.yaml
kubectl apply -f tf-job.yaml
Watch the training process
kubectl describe tfjob tensorflow-training
kubectl logs --follow tensorflow-training-worker-0
kubectl logs po/tensorflow-training-worker-0
To remove the TFJob and associated pods
kubectl delete tfjob tensorflow-training
Get a shell to the container
kubectl exec --stdin --tty tensorflow-training-worker-0 -- /bin/bash
Run Katib Experiment Create Hyperparameter Tuning Katib Experiment with random search algorithm using kubectl:
kubectl create -f example2.yaml
The Experiment runs twelve training jobs (Trials) and tunes the following hyperparameters:
Learning Rate (lr).
Batch size (batch-size).
After creating above example, check the Experiment status:
$ kubectl get experiment tfjob-mnist -n kubeflow
Check the Suggestion status:
$ kubectl get suggestion -n kubeflow
Check the Trials statuses:
$ kubectl get trial -n kubeflow
To see the logs of each trial:
kubectl logs --follow random-example-2rmwkwx6-hkdwb -n kubeflow -c metrics-logger-and-collector
You can get the best hyperparameters with the following command:
$ kubectl get experiment tfjob-mnist -n kubeflow -o jsonpath='{range .status.currentOptimalTrial.parameterAssignments[*]}{.name}: {.value}{"\n"}{end}'
learning_rate: 0.03786431590246644
batch_size: 1
Port-forward to the katib UI
kubectl port-forward svc/katib-ui 8080:80 -n kubeflow
xdg-open http://localhost:8080/katib
kubectl create namespace argo
kubectl apply -n argo -f https://github.com/argoproj/argo-workflows/releases/download/v3.1.3/install.yaml
terraform output -json | jq -r .kubeconfig.value >> .kubeconfig
artifacts: paths: .kubeconfig