dstackai / dstack

dstack is a lightweight, open-source alternative to Kubernetes & Slurm, simplifying AI container orchestration with multi-cloud & on-prem support. It natively supports NVIDIA, AMD, & TPU.
https://dstack.ai/docs
Mozilla Public License 2.0
1.51k stars 153 forks source link

[Bug] Sometimes the `azure` backend hangs for very long on instance creation #1353

Closed peterschmidt85 closed 1 month ago

peterschmidt85 commented 4 months ago

Steps to reproduce:

  1. Run an instance using the azure backend

Reproduced only sometimes (I guess when there is no capacity but that's not sure)

Actual behavior:

  1. The run hangs as submitted
  2. The last server log is
{
    "message": "Requesting Standard_NV6ads_A10_v5 spot instance in westeurope...",
    "logger": "dstack._internal.core.backends.azure.compute",
    "timestamp": "2024-06-24 13:16:29,022",
    "level": "INFO"
}
  1. Stopping the run doesn't help

Notes:

The impact of this issue is unclear and yet to be confirmed. Certainly blocks the current user. Unsure if other users are also blocked.

r4victor commented 4 months ago

AzureCompute.create_instance() hangs while waiting for the vm to be created here:

https://github.com/dstackai/dstack/blob/4a7a69127ff17727a15f7c6eff99b5940f9245e2/src/dstack/_internal/core/backends/azure/compute.py#L455

On Azure side, the vm stuck in the Creating state – that's why create_instance() never returns.

Should be fixed by setting timeout on poller.result().

We need to ensure all requests to clouds have timeouts set.

r4victor commented 4 months ago

Also, consider updating job processing tasks so that the server can process more than one job/run in parallel to prevent one stuck job from blocking the processing of other jobs.

peterschmidt85 commented 3 months ago

This issue is stale because it has been open for 30 days with no activity.