xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerators such as TPUs and GPUs on GKE.
If device type check fails both GKE and accelerator type and cluster device type, the workload creation will succeed anyway. Such workloads should not be created, since there can be incompatibilities between the workload and cluster configs.
Testing / Documentation
Attempted to create a workload with invalid device type for the cluster:
(xpk) cloudtop [~/xpk] % python xpk.py workload create --device-type h100-80gb-8 --project XXXXXX --zone XXXXXX --cluster XXXXXX --docker-image XXXXXX \
--command "echo hello" \
--num-nodes 2 --workload jonbolin-test-$RANDOM
[XPK] Starting xpk
...
[XPK] Gke Accelerator Type Check: jonbolin-test-998 is requesting nvidia-h100-80gb but cluster only contains dict_keys(['h100-mega-80gb-8']).
[XPK] Device Type Check: jonbolin-test-aot-998 is requesting h100-80gb-8 but cluster only contains dict_keys(['h100-mega-80gb-8']).
[XPK] Both Device Type and GKE Accelerator Type checks failed. XPK will not create the workload jonbolin-test-aot-998.
[XPK] XPK failed, error code 1
[ y ] Tests pass
[ y ] Appropriate changes to documentation are included in the PR
Fixes / Features
Testing / Documentation
Attempted to create a workload with invalid device type for the cluster: