AI-Hypercomputer / xpk

xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerators such as TPUs and GPUs on GKE.
Apache License 2.0
81 stars 23 forks source link

better core dump for debugging #184

Open ZhiyuLi-goog opened 2 months ago

ZhiyuLi-goog commented 2 months ago

Features

upload more detailed info to gcs bucket in failures

Switch to gcloud storage cp, since it around twice as fast as gsutil -m cp.

Testing / Documentation