xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerators such as TPUs and GPUs on GKE.
Apache License 2.0
81
stars
23
forks
source link
Export EXIT_CODE from the user provided command, and propagate the error to Cloud Console UI #73
Fixes / Features
Testing / Documentation
Is Happy workload successful with EXIT_CODE=0
python3 xpk.py workload create --cluster vbarr-v4-xpk-test-tpu --workload vbarr-xpk-error-test-2 --command "bash test.sh" --tpu-type=v4-8 --num-slices=1 --zone=us-central2-b --project=cloud-tpu-multipod-dev
Error workload with EXITO_CODE=1
python3 xpk.py workload create --cluster vbarr-v4-xpk-test-tpu --workload vbarr-xpk-error-test-fail --command "bash test.sh && ech a" --tpu-type=v4-8 --num-slices=1 --zone=us-central2-b --project=cloud-tpu-multipod-dev
Debug flag work?
XLA Flag work?
python3 xpk.py workload create --cluster vbarr-v4-xpk-test-tpu --workload vbarr-xpk-error-test-fail --command "bash test.sh && ech a" --debug-dump-gcs=some_bucket --enable-debug-logs --tpu-type=v4-8 --num-slices=1 --zone=us-central2-b --project=cloud-tpu-multipod-dev
Does the error propagate to pantheon UI?