Ray TPU Webhook Autoscaling Support

GoogleCloudPlatform / ai-on-gke

AI on GKE is a collection of examples, best-practices, and prebuilt solutions to help build, deploy, and scale AI Platforms on Google Kubernetes Engine

Apache License 2.0

211 stars 154 forks source link

Ray TPU Webhook Autoscaling Support #740

Closed ryanaoleary closed 1 month ago

ryanaoleary commented 1 month ago

This PR adds support for autoscaling RayClusters by generating TPU_WORKER_HOSTNAMES when intercepting each Pod, rather than when intercepting the RayCluster CR.

This PR has been tested as follows:

[x] Unit Tests
- [x] Manual Tests using single-host, multi-host, and an autoscaling RayCluster with a TPU worker group added

andrewsykim commented 1 month ago

This PR adds support for autoscaling RayClusters by generating TPU_WORKER_HOSTNAMES when intercepting each Pod, rather than when intercepting the RayCluster CR.

Will existing pods need to have the TPU_WORKER_HOSTNAMES env var updated when new replicas are added or does it only apply for new pods added by the autoscaler?

ryanaoleary commented 1 month ago

This PR adds support for autoscaling RayClusters by generating TPU_WORKER_HOSTNAMES when intercepting each Pod, rather than when intercepting the RayCluster CR.

Will existing pods need to have the TPU_WORKER_HOSTNAMES env var updated when new replicas are added or does it only apply for new pods added by the autoscaler?

We don't need to update existing Pods. We assume that TPU PodSlices are scaled atomically so the TPU_WORKER_HOSTNAMES we add to each Pod include the DNS hostnames for all the Pods in the slice, even if they haven't been created yet.

andrewsykim commented 1 month ago

I think this is good to merge, but let's ensure https://github.com/GoogleCloudPlatform/ai-on-gke/pull/723 is also merged before cutting a new tag with the autoscaling changes