AI-Hypercomputer / xpk

xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerators such as TPUs and GPUs on GKE.
Apache License 2.0
81 stars 23 forks source link

Fix autoprovisioning with spot nodes #187

Open avrittrohwer opened 1 month ago

avrittrohwer commented 1 month ago

Fixes / Features

Testing / Documentation

Node auto-provisioning with spot

  1. Created a xpk cluster with --spot and autoprovisioning flags.
  2. Created a workload with a different topology than the cluster default.
  3. Observed a nodepool being created with the new workload topology using spot TPU nodes.

Node auto-provisioning without spot

  1. Created a xpk cluster with --spot and autoprovisioning flags.
  2. Created a workload with a different topology than the cluster default and --on-demand flag.
  3. Validated generated YAML does not specify spot node-selector and tolerations
  4. Observed a nodepool being created with the new workload topology using on-demand TPU nodes.

Not auto-provisioning with spot

  1. Created a xpk cluster with --spot flag.
  2. Validated nodepool was created with spot nodes
  3. Created a workload and validated it ran.
avrittrohwer commented 1 month ago

zone: 'us-central2'>] finished with error: Try a different location, or try again later: Google Compute Engine does not have enough resources available to fulfill request: us-central2-b