xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerators such as TPUs and GPUs on GKE.
python3 $XPK inspector ... prints out debug information related to kueue, jobset, workload, gke cluster , gke node pools, and cloud console. It exports all the info needed to debug workload failures to a file to debug with.
Prints out Cloud console links to quota page, iam permission page, gke cluster, workload, all workloads
Set the --workload flag when interested in why a specific workload is not working
Run without the workload flag to investigate general health of the gke cluster and workload management such as node pool health
set the --print-to-terminal arg to print details to console.
Users can investigate the lawyer file and also share the lawyer file with experts to quickly allow investigations to uncover failures.
Features
python3 $XPK inspector ...
prints out debug information related to kueue, jobset, workload, gke cluster , gke node pools, and cloud console. It exports all the info needed to debug workload failures to a file to debug with.--workload
flag when interested in why a specific workload is not working--print-to-terminal
arg to print details to console.Users can investigate the lawyer file and also share the lawyer file with experts to quickly allow investigations to uncover failures.
Testing / Documentation
Ran the following commands:
python3 $XPK inspector --cluster $CLUSTER_NAME --print-to-terminal --project $PROJECT --zone $ZONE --workload vbarr-4-slice-test7-29
python3 $XPK inspector --cluster $CLUSTER_NAME --print-to-terminal --project $PROJECT --zone $ZONE
python3 $XPK inspector --cluster $CLUSTER_NAME --project $PROJECT --zone $ZONE