GoogleCloudPlatform / ai-on-gke

AI on GKE is a collection of examples, best-practices, and prebuilt solutions to help build, deploy, and scale AI Platforms on Google Kubernetes Engine
Apache License 2.0
211 stars 154 forks source link

TPU Provisioner reliability improvements #614

Closed danielvegamyhre closed 4 months ago

danielvegamyhre commented 4 months ago

This PR includes the following changes:

Next steps for follow up PRs:

danielvegamyhre commented 4 months ago

/gcbrun

danielvegamyhre commented 4 months ago

LGTM, we could add a check to see if the JobSet is still active or not as well as existent - or wait to add in another PR

Makes sense, I have the jobset utils for checking status etc included as part of the follow up PR for the deletion controller update. I'll include that change in that one.

danielvegamyhre commented 4 months ago

/gcbrun

danielvegamyhre commented 4 months ago

/retest

danielvegamyhre commented 4 months ago

/gcbrun