dstack is an open-source alternative to Kubernetes, designed to simplify development, training, and deployment of AI across any cloud or on-prem. It supports NVIDIA, AMD, and TPU.
Then wait until the run is running and switch off the network on dstack-server's host.
Actual behaviour
The run is marked failed, the instance is marked terminated. However, the instance actually still exists in RunPod and the user is billed for it.
Expected behaviour
The instance is not marked terminated until it is actually deleted in RunPod.
dstack version
master
Server logs
[09:21:18] DEBUG dstack._internal.core.services.ssh.tunnel:73 SSH tunnel failed: b'ssh: connect to host 194.68.245.18 port 22056: Network is
unreachable\r\n'
I0000 00:00:1723620079.263387 1605744 work_stealing_thread_pool.cc:320] WorkStealingThreadPoolImpl::PrepareFork
[09:21:19] DEBUG dstack._internal.core.services.ssh.tunnel:73 SSH tunnel failed: b'ssh: connect to host 194.68.245.18 port 22056: Network is
unreachable\r\n'
WARNING dstack._internal.server.background.tasks.process_running_jobs:259 job(e3ec13)polite-starfish-1-0-0: failed because runner is not available
or return an error, age=0:03:00.121137
INFO dstack._internal.server.background.tasks.process_runs:338 run(5dd434)polite-starfish-1: run status has changed RUNNING -> TERMINATING
[09:21:21] DEBUG dstack._internal.server.services.jobs:238 job(e3ec13)polite-starfish-1-0-0: stopping container
INFO dstack._internal.server.services.jobs:269 job(e3ec13)polite-starfish-1-0-0: instance 'polite-starfish-1-0' has been released, new status is
TERMINATING
INFO dstack._internal.server.services.jobs:286 job(e3ec13)polite-starfish-1-0-0: job status is FAILED, reason: INTERRUPTED_BY_NO_CAPACITY
[09:21:22] INFO dstack._internal.server.services.runs:932 run(5dd434)polite-starfish-1: run status has changed TERMINATING -> FAILED, reason: JOB_FAILED
[09:21:23] ERROR dstack._internal.server.background.tasks.process_instances:763 Got exception when terminating instance polite-starfish-1-0
Traceback (most recent call last):
[... long stack trace ...]
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='api.runpod.io', port=443): Max retries exceeded with url:
/graphql?api_key=***** (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at
0x7f5fb5a98a90>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))
INFO dstack._internal.server.background.tasks.process_instances:773 Instance polite-starfish-1-0 terminated
Additional information
I reproduced this issue on RunPod and Vast.ai but not OCI. Maybe the behavior is different for container-based and VM-based backends. On OCI, dstack makes many attempts at deleting the instance and only marks it terminated after succeeding, which is the expected behavior.
Ideally, the job also should not be marked failed if the connectivity issues are on dstack-server's side, not on instance's side. But this condition is difficult to detect, so it is out of scope for this issue.
Steps to reproduce
Then wait until the run is
running
and switch off the network ondstack-server
's host.Actual behaviour
The run is marked
failed
, the instance is markedterminated
. However, the instance actually still exists in RunPod and the user is billed for it.Expected behaviour
The instance is not marked
terminated
until it is actually deleted in RunPod.dstack version
master
Server logs
Additional information
I reproduced this issue on RunPod and Vast.ai but not OCI. Maybe the behavior is different for container-based and VM-based backends. On OCI,
dstack
makes many attempts at deleting the instance and only marks itterminated
after succeeding, which is the expected behavior.Ideally, the job also should not be marked
failed
if the connectivity issues are ondstack-server
's side, not on instance's side. But this condition is difficult to detect, so it is out of scope for this issue.