AI-Hypercomputer / xpk

xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerators such as TPUs and GPUs on GKE.
Apache License 2.0
81 stars 23 forks source link

Add xpk inspector #80

Closed Obliviour closed 8 months ago

Obliviour commented 8 months ago

Features

python3 $XPK inspector ... prints out debug information related to kueue, jobset, workload, gke cluster , gke node pools, and cloud console. It exports all the info needed to debug workload failures to a file to debug with.

Users can investigate the lawyer file and also share the lawyer file with experts to quickly allow investigations to uncover failures.

Testing / Documentation

Ran the following commands:

python3 $XPK inspector --cluster $CLUSTER_NAME --print-to-terminal --project $PROJECT --zone $ZONE --workload vbarr-4-slice-test7-29

python3 $XPK inspector --cluster $CLUSTER_NAME --print-to-terminal --project $PROJECT --zone $ZONE

python3 $XPK inspector --cluster $CLUSTER_NAME --project $PROJECT --zone $ZONE

SurbhiJainUSC commented 8 months ago

Looks good to me. Minor fix in PR description: Replace print-to-console with print-to-terminal.