dstackai / dstack

dstack is an open-source alternative to Kubernetes, designed to simplify development, training, and deployment of AI across any cloud or on-prem. It supports NVIDIA, AMD, and TPU.
https://dstack.ai/docs
Mozilla Public License 2.0
1.32k stars 98 forks source link

TPUs may be interrupted immediately after provisioning leading to suboptimal retry #1359

Open r4victor opened 2 months ago

r4victor commented 2 months ago

While testing TPUs provisioning, I noticed that both on-demand and spot TPUs can be deleted right after a successful call to create the TPU. The server correctly fails the job with FAILED_TO_START_DUE_TO_NO_CAPACITY so it can be retried with retry. But retry is likely to try the same offers leading to suboptimal retry.

Perhaps we can introduce a local cache of failed offers or randomize offers order (e.g. regions/zones) to fix this.

peterschmidt85 commented 1 month ago

This issue is stale because it has been open for 30 days with no activity.