GoogleCloudPlatform / ai-on-gke

AI on GKE is a collection of examples, best-practices, and prebuilt solutions to help build, deploy, and scale AI Platforms on Google Kubernetes Engine
Apache License 2.0
186 stars 140 forks source link

Ray TPU Webhook Autoscaling Support and Reliability Improvements #723

Open ryanaoleary opened 1 week ago

ryanaoleary commented 1 week ago

This PR improves the reliability of the webhook by making it stateless in between calls, fixing issues related to the sliceToWorkers mapping being cleared upon webhook restart. These changes rely on adding a k8s client to the webhook that lists the current Pods in the same namespace as the intercepted Pod. These changes remove the need to intercept Pod deletion requests. Additionally, this PR generates TPU_WORKER_HOSTNAMES when intercepting each Pod, rather than the RayCluster, supporting autoscaling RayClusters.

This PR has been tested as follows: