AI-Hypercomputer / xpk

xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerators such as TPUs and GPUs on GKE.
Apache License 2.0
81 stars 23 forks source link

Add flag to restart-on-user-failures, otherwise do not #99

Closed Obliviour closed 7 months ago

Obliviour commented 7 months ago

Fixes / Features

Testing / Documentation

Test that workload did not restart on user failure


# Doesnt restart the job
echo -e '#!/bin/bash \n echo "Hello world from a test script!"; exit 1' > test_for_user_error.sh
python3 xpk.py workload create --cluster C   --workload vbarr-xpk-expect-error-1 --command "bash test_for_user_error.sh"   --tpu-type=v4-8 --num-slices=2 --zone=us-central2-b --project=cloud-tpu-multipod-dev

# Does restart the job 10 times
echo -e '#!/bin/bash \n echo "Hello world from a test script!"; exit 1' > test_for_user_error.sh
python3 xpk.py workload create --clusterC   --workload vbarr-xpk-expect-error-retry-1 --command "bash test_for_user_error.sh"   --tpu-type=v4-8 --num-slices=2 --zone=us-central2-b --project=cloud-tpu-multipod-dev --restart-on-user-code-failure --max-restarts 10

# Doesn't restart the job
echo -e '#!/bin/bash \n echo "Hello world from a test script!"; exit 1' > test_for_user_error.sh
python3 xpk.py workload create --cluster C   --workload vbarr-xpk-expect-error-no-retry-1 --command "bash test_for_user_error.sh"   --tpu-type=v4-8 --num-slices=2 --zone=us-central2-b --project=cloud-tpu-multipod-dev --restart-on-user-code-failure --max-restarts 10