AI-Hypercomputer / xpk

xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerators such as TPUs and GPUs on GKE.
Apache License 2.0
81 stars 23 forks source link

Export EXIT_CODE from the user provided command, and propagate the error to Cloud Console UI #73

Closed Obliviour closed 8 months ago

Obliviour commented 8 months ago

Fixes / Features

Testing / Documentation

Is Happy workload successful with EXIT_CODE=0

python3 xpk.py workload create --cluster vbarr-v4-xpk-test-tpu --workload vbarr-xpk-error-test-2 --command "bash test.sh" --tpu-type=v4-8 --num-slices=1 --zone=us-central2-b --project=cloud-tpu-multipod-dev image

Error workload with EXITO_CODE=1

python3 xpk.py workload create --cluster vbarr-v4-xpk-test-tpu --workload vbarr-xpk-error-test-fail --command "bash test.sh && ech a" --tpu-type=v4-8 --num-slices=1 --zone=us-central2-b --project=cloud-tpu-multipod-dev image

Debug flag work?

XLA Flag work?

python3 xpk.py workload create --cluster vbarr-v4-xpk-test-tpu --workload vbarr-xpk-error-test-fail --command "bash test.sh && ech a" --debug-dump-gcs=some_bucket --enable-debug-logs --tpu-type=v4-8 --num-slices=1 --zone=us-central2-b --project=cloud-tpu-multipod-dev image

Does the error propagate to pantheon UI?

image