AI-Hypercomputer / xpk

xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerators such as TPUs and GPUs on GKE.
Apache License 2.0
81 stars 23 forks source link

Fix issue with device check failure #167

Closed jonb377 closed 3 months ago

jonb377 commented 3 months ago

Fixes / Features

Testing / Documentation

Attempted to create a workload with invalid device type for the cluster:

(xpk) cloudtop [~/xpk] % python xpk.py workload create --device-type h100-80gb-8 --project XXXXXX --zone XXXXXX --cluster XXXXXX --docker-image XXXXXX \
                --command "echo hello" \
                --num-nodes 2 --workload jonbolin-test-$RANDOM
[XPK] Starting xpk
...
[XPK] Gke Accelerator Type Check: jonbolin-test-998 is requesting nvidia-h100-80gb but cluster only contains dict_keys(['h100-mega-80gb-8']).
[XPK] Device Type Check: jonbolin-test-aot-998 is requesting h100-80gb-8 but cluster only contains dict_keys(['h100-mega-80gb-8']).
[XPK] Both Device Type and GKE Accelerator Type checks failed. XPK will not create the workload jonbolin-test-aot-998.
[XPK] XPK failed, error code 1
jonb377 commented 3 months ago

@Obliviour It seems the tests are failing, but it seems unrelated. Should I retry or can we override?