dstack is an open-source alternative to Kubernetes, designed to simplify development, training, and deployment of AI across any cloud or on-prem. It supports NVIDIA, AMD, and TPU.
While testing TPUs provisioning, I noticed that both on-demand and spot TPUs can be deleted right after a successful call to create the TPU. The server correctly fails the job with FAILED_TO_START_DUE_TO_NO_CAPACITY so it can be retried with retry. But retry is likely to try the same offers leading to suboptimal retry.
Perhaps we can introduce a local cache of failed offers or randomize offers order (e.g. regions/zones) to fix this.
While testing TPUs provisioning, I noticed that both on-demand and spot TPUs can be deleted right after a successful call to create the TPU. The server correctly fails the job with FAILED_TO_START_DUE_TO_NO_CAPACITY so it can be retried with
retry
. But retry is likely to try the same offers leading to suboptimal retry.Perhaps we can introduce a local cache of failed offers or randomize offers order (e.g. regions/zones) to fix this.