xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerators such as TPUs and GPUs on GKE.
Apache License 2.0
81
stars
23
forks
source link
Add flag to restart-on-user-failures, otherwise do not #99
Fixes / Features
--restart-on-user-failures
Testing / Documentation
Test that workload did not restart on user failure