GoogleCloudPlatform / ai-on-gke

AI on GKE is a collection of examples, best-practices, and prebuilt solutions to help build, deploy, and scale AI Platforms on Google Kubernetes Engine
Apache License 2.0
211 stars 154 forks source link

Add a Helm chart for the KubeRay TPU Webhook #745

Closed ryanaoleary closed 1 month ago

ryanaoleary commented 1 month ago

This PR adds a helm chart for the TPU webhook, allowing users to install the webhook in their GKE cluster with a simpler set of commands (rather than cloning ai-on-gke and running the make commands). The values file in the helm-chart enables users to more easily edit the namespace and image used by the webhook deployment.

# tested locally with the following command from the kuberay-tpu-webhook folder
helm install tpu-webhook helm-chart/

# once merged, the webhook would be installed with the following command
helm install tpu-webhook https://github.com/GoogleCloudPlatform/ai-on-gke.git/ray-on-gke/tpu/kuberay-tpu-webhook/helm-chart

This PR has been manually tested by running the above commands, checking that the webhook deploys successfully, and verifying that the TPU env vars are successfully injected to Ray TPU pods.

TODO: once this PR is merged I'll update the webhook install commands in https://github.com/ray-project/ray/pull/45924.

andrewsykim commented 1 month ago

Is the extra step to install cert-manager still required, or can we also bundle that into the helm chart?

ryanaoleary commented 1 month ago

Is the extra step to install cert-manager still required, or can we also bundle that into the helm chart?

Currently it's still required to install cert-manager separately, since it has its own helm chart and we use a modified helm install command to allow cert-manager to work with Autopilot GKE clusters.

helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install --create-namespace --namespace cert-manager --set installCRDs=true --set global.leaderElection.namespace=cert-manager cert-manager jetstack/cert-manager

I couldn't find a good way to add the cert-manager helm chart as a dependency since I'm not sure if you can pass the above arguments to the Chart.yaml.

Update: Nevermind I was misdiagnosing the issue, the arguments are passed successfully for cert-manager as a dependency, but the install fails with:

Error: INSTALLATION FAILED: unable to build kubernetes objects from release manifest: resource mapping not found for name: "kuberay-tpu-webhook-certs" namespace: "ray-system" from "": no matches for kind "Certificate" in version "cert-manager.io/v1" ensure CRDs are installed first

since cert-manager hasn't finished installing before we attempt to create the issuer/cert.

andrewsykim commented 1 month ago

Thanks, keeping cert-manager installation as a separate step seems fine, was just curious if it was an option.