dstackai / dstack

dstack is an easy-to-use and flexible container orchestrator for running AI workloads in any cloud or data center.
https://dstack.ai
Mozilla Public License 2.0
1.23k stars 90 forks source link

[Bug]: Run fails on TPU after 16 minutes with `INTERRUPTED_BY_NO_CAPACITY` #1397

Open yixiaoer opened 3 days ago

yixiaoer commented 3 days ago

Steps to reproduce

dstack run . -f train.dstack.yml -b gcp --gpu tpu-v2-8

Actual behaviour

The process ran normally, including output info from installing packages, downloading the dataset from Hugging Face, connecting to wandb, and mapping. However, after running for 16 minutes, the process terminated without displaying any specific runtime error. Instead, it ended with the following message:

Run failed with error code INTERRUPTED_BY_NO_CAPACITY. Check CLI and server logs for more details.

Expected behaviour

The script should run to completion without interruption.

dstack version

0.18.4

Server logs

[02:54:32] DEBUG    dstack._internal.server.app:181 Processed request POST
                    http://127.0.0.1:3000/api/project/main/runs/get in 0.027839s
           DEBUG    dstack._internal.server.background.tasks.process_running_jobs:234
                    job(d4ff61)short-starfish-1-0-0: process running job, age=0:16:13.164768
[02:54:33] DEBUG    dstack._internal.server.app:181 Processed request POST
                    http://127.0.0.1:3000/api/project/main/runs/get in 0.025827s
[02:54:34] DEBUG    dstack._internal.server.app:181 Processed request POST
                    http://127.0.0.1:3000/api/project/main/runs/get in 0.024967s
[02:54:35] DEBUG    dstack._internal.server.app:181 Processed request POST
                    http://127.0.0.1:3000/api/project/main/runs/get in 0.025046s
[02:54:36] DEBUG    dstack._internal.server.app:181 Processed request POST
                    http://127.0.0.1:3000/api/project/main/runs/get in 0.028162s
           DEBUG    dstack._internal.server.background.tasks.process_instances:554 Check
                    instance short-starfish-1-0-0 status. shim health: Service is OK
           DEBUG    dstack._internal.server.background.tasks.process_running_jobs:234
                    job(d4ff61)short-starfish-1-0-0: process running job, age=0:16:17.148749
           DEBUG    dstack._internal.core.services.ssh.tunnel:63 SSH tunnel failed: b'ssh:
                    connect to host 34.42.95.182 port 22: Connection refused\r\n'
[02:54:37] DEBUG    dstack._internal.server.app:181 Processed request POST
                    http://127.0.0.1:3000/api/project/main/runs/get in 0.024913s
           DEBUG    dstack._internal.core.services.ssh.tunnel:63 SSH tunnel failed: b'ssh:
                    connect to host 34.42.95.182 port 22: Connection refused\r\n'
[02:54:38] DEBUG    dstack._internal.server.app:181 Processed request POST
                    http://127.0.0.1:3000/api/project/main/runs/get in 0.024718s
           DEBUG    dstack._internal.core.services.ssh.tunnel:63 SSH tunnel failed: b'ssh:
                    connect to host 34.42.95.182 port 22: Connection refused\r\n'
           WARNING  dstack._internal.server.background.tasks.process_running_jobs:246
                    job(d4ff61)short-starfish-1-0-0: failed because runner is not available
                    or return an error,  age=0:16:19.212862
           INFO     dstack._internal.server.background.tasks.process_runs:333
                    run(4b7414)short-starfish-1: run status has changed RUNNING ->
                    TERMINATING
[02:54:39] DEBUG    dstack._internal.server.app:181 Processed request POST
                    http://127.0.0.1:3000/api/project/main/runs/get in 0.025353s
[02:54:40] DEBUG    dstack._internal.server.app:181 Processed request POST
                    http://127.0.0.1:3000/api/project/main/runs/get in 0.027115s
           DEBUG    dstack._internal.server.services.jobs:224
                    job(d4ff61)short-starfish-1-0-0: stopping container
           DEBUG    dstack._internal.core.services.ssh.tunnel:63 SSH tunnel failed: b'ssh:
                    connect to host 34.42.95.182 port 22: Connection refused\r\n'
[02:54:41] DEBUG    dstack._internal.server.app:181 Processed request POST
                    http://127.0.0.1:3000/api/project/main/runs/get in 0.023704s
           DEBUG    dstack._internal.core.services.ssh.tunnel:63 SSH tunnel failed: b'ssh:
                    connect to host 34.42.95.182 port 22: Connection refused\r\n'
[02:54:42] DEBUG    dstack._internal.server.app:181 Processed request POST
                    http://127.0.0.1:3000/api/project/main/runs/get in 0.023479s
           DEBUG    dstack._internal.core.services.ssh.tunnel:63 SSH tunnel failed: b'ssh:
                    connect to host 34.42.95.182 port 22: Connection refused\r\n'
           INFO     dstack._internal.server.services.jobs:248
                    job(d4ff61)short-starfish-1-0-0: instance 'short-starfish-1-0-0' has been
                    released, new status is IDLE
           INFO     dstack._internal.server.services.jobs:265
                    job(d4ff61)short-starfish-1-0-0: job status is FAILED, reason:
                    INTERRUPTED_BY_NO_CAPACITY
           INFO     dstack._internal.server.services.runs:821 run(4b7414)short-starfish-1:
                    run status has changed TERMINATING -> FAILED, reason: JOB_FAILED
[02:54:43] DEBUG    dstack._internal.server.app:181 Processed request POST
                    http://127.0.0.1:3000/api/project/main/runs/get in 0.024886s
[02:54:55] DEBUG    dstack._internal.core.services.ssh.tunnel:63 SSH tunnel failed: b'ssh:
                    connect to host 34.42.95.182 port 22: Connection timed out\r\n'
           DEBUG    dstack._internal.server.background.tasks.process_instances:554 Check
                    instance short-starfish-1-0-0 status. shim health: SSH or tunnel error
           WARNING  dstack._internal.server.background.tasks.process_instances:601 Instance
                    short-starfish-1-0-0 shim is not available
[02:55:05] DEBUG    dstack._internal.core.services.ssh.tunnel:63 SSH tunnel failed: b'ssh:
                    connect to host 34.42.95.182 port 22: Connection timed out\r\n'
           DEBUG    dstack._internal.server.background.tasks.process_instances:554 Check
                    instance short-starfish-1-0-0 status. shim health: SSH or tunnel error
           WARNING  dstack._internal.server.background.tasks.process_instances:601 Instance
                    short-starfish-1-0-0 shim is not available

Additional information

No response

r4victor commented 2 days ago

@yixiaoer, are you running a spot or on-demand TPU? INTERRUPTED_BY_NO_CAPACITY means the instance became unavailable and this usually happens with spot instances when they get interrupted by the cloud provider.

If the task can handle interruptions, you can specify a retry policy to resubmit the run when it gets interrupted.

If the task cannot handle interruptions, consider using on-demand instances: pass --on-demand to dstack run.

yixiaoer commented 2 days ago

I was running on a spot TPU. After specifying the retry option with interruption, it retried, but later still lost connection. And also tried using the --on-demand option with dstack run, but the problem persists.

Can this be related to the TPU memory capacity? The dataset downloaded is quite large (approximately 22GB, specified to download in /dev/shm). However, no errors were reported for the code running; I also specified in .dstack.yml:

resources:
  memory: 100GB
  shm_size: 50GB

Is this the correct way to specify resources for TPUs? Given the situation, is there anything else I can do to resolve this issue?

peterschmidt85 commented 2 days ago

@yixiaoer, it's quite strange that the problem persists with --on-demand. Could you please double-check it? Also, show dstack ps once you try it to see whether it used spot or not.

Also, to ensure on-demand is used, you can set in the YAML spot_policy to on-demand then to ensure it doesn't use spot instances.

Please let me know if you can check it.

Yes, the resources looks OK to me!

Also, in case it doesn't work again, could you please share the repo with train.dstack.yml and scripts so we can try to reproduce it?