Closed ryanaoleary closed 1 month ago
This PR adds support for autoscaling RayClusters by generating TPU_WORKER_HOSTNAMES when intercepting each Pod, rather than when intercepting the RayCluster CR.
Will existing pods need to have the TPU_WORKER_HOSTNAMES
env var updated when new replicas are added or does it only apply for new pods added by the autoscaler?
This PR adds support for autoscaling RayClusters by generating TPU_WORKER_HOSTNAMES when intercepting each Pod, rather than when intercepting the RayCluster CR.
Will existing pods need to have the
TPU_WORKER_HOSTNAMES
env var updated when new replicas are added or does it only apply for new pods added by the autoscaler?
We don't need to update existing Pods. We assume that TPU PodSlices are scaled atomically so the TPU_WORKER_HOSTNAMES
we add to each Pod include the DNS hostnames for all the Pods in the slice, even if they haven't been created yet.
I think this is good to merge, but let's ensure https://github.com/GoogleCloudPlatform/ai-on-gke/pull/723 is also merged before cutting a new tag with the autoscaling changes
This PR adds support for autoscaling RayClusters by generating
TPU_WORKER_HOSTNAMES
when intercepting each Pod, rather than when intercepting the RayCluster CR.This PR has been tested as follows: