xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerators such as TPUs and GPUs on GKE.
Adding single slice CPU support for XPK, works with n2-standard-32 machines.
Major changes:
allow for CPU device-type in the format <"n2-standard-32">-<# of n2-standard-32 VMs>
to avoid confusion with cluster's default pool CPU type, deprecated the argument "--cluster-cpu-machine-type" in favor of "--default-pool-cpu-machine-type" and added error check.
added nodeAffinity to workload create spec, which is useful for CPUs to ensure that workload pods don't get scheduled onto default pool machines.
introduced CPU cluster env to inject JAX variables into GKE
Testing / Documentation
Tested cluster creation, CPU nodepool creation with args and env properly set.
Ran Maxtext workloads on CPUs using these changes.
Documented changes in README as well.
Verified that TPU training (with Maxtext workloads) run as expected.
[ y ] Tests pass
[ y ] Appropriate changes to documentation are included in the PR
Features
Major changes:
Testing / Documentation
Tested cluster creation, CPU nodepool creation with args and env properly set.
Ran Maxtext workloads on CPUs using these changes.
Documented changes in README as well.
Verified that TPU training (with Maxtext workloads) run as expected.
[ y ] Tests pass
[ y ] Appropriate changes to documentation are included in the PR