dstackai / dstack

dstack is a lightweight, open-source alternative to Kubernetes & Slurm, simplifying AI container orchestration with multi-cloud & on-prem support. It natively supports NVIDIA, AMD, & TPU.
https://dstack.ai/docs
Mozilla Public License 2.0
1.49k stars 153 forks source link

[Bug]: Services stop responding after `dstack-gateway` reboot #1565

Open jvstme opened 2 months ago

jvstme commented 2 months ago

Steps to reproduce

  1. Create a gateway
  2. Run one or several services behind the gateway
  3. Reboot the instance the gateway is running on, e.g. via its cloud console

Actual behaviour

Previously created services no longer respond.

curl https://gateway.mygateway.example/chat/completions -H 'Authorization: Bearer *****' -H 'Content-Type: application/json' -d '{"model":"llama3.1", "messages": [{"role":"user", "content":"Hi"}]}'
{"error":"GatewayError","message":"<html>\r\n<head><title>502 Bad Gateway</title></head>\r\n<body>\r\n<center><h1>502 Bad Gateway</h1></center>\r\n<hr><center>nginx/1.18.0 (Ubuntu)</center>\r\n</body>\r\n</html>\r\n"}

Expected behaviour

Previously created services should continue responding, since rebooting the gateway's instance is a rare but possible circumstance.

dstack version

master

Server logs

No response

Additional information

dstack-gateway creates SSH tunnels to services and stores control sockets in the /tmp directory that does not survive machine reboots.

Stealthwriter commented 2 months ago

yea same issue

r4victor commented 1 month ago

A rebooted gateway instance leads to services not working. Marking it as major since this doesn't have a simple workaround fix.

un-def commented 1 month ago

Fleet instances won't survive reboot either, at least on some backends (tested with gcp) — they don't have dstack public keys after reboot.

jvstme commented 3 weeks ago

Sometimes the dstack-gateway application won't start at all after instance reboot. Recently a planned reboot lead to empty ~/dstack/state.json file, so dstack-gateway failed to restart and gateway state was lost.

Oct 14 14:56:48 ip-172-31-30-166 sh[614992]: INFO:     127.0.0.1:36098 - "GET /api/stats/collect HTTP/1.1" 200 OK
Oct 14 14:56:52 ip-172-31-30-166 sh[614992]: INFO:     127.0.0.1:33052 - "GET /api/stats/collect HTTP/1.1" 200 OK
-- Boot 2744f08ca7ef4101be2445a849b165b3 --
Oct 14 14:57:24 ip-172-31-30-166 systemd[1]: Started dstack gateway service.
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: INFO:     Started server process [395]
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: INFO:     Waiting for application startup.
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: 2024-10-14 14:57:28,321 - dstack.gateway.core.persistent - DEBUG - Loading state from /home/ubuntu/dstack/state.json
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: ERROR:    Traceback (most recent call last):
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:   File "/home/ubuntu/dstack/blue/lib/python3.10/site-packages/starlette/routing.py", line 734, in lifespan
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:     async with self.lifespan_context(app) as maybe_state:
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:   File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:     return await anext(self.gen)
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:   File "/home/ubuntu/dstack/blue/lib/python3.10/site-packages/dstack/gateway/main.py", line 24, in lifespan
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:     store = get_store()
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:   File "/home/ubuntu/dstack/blue/lib/python3.10/site-packages/dstack/gateway/core/store.py", line 346, in get_store
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:     store = Store.model_validate(get_persistent_state().get(Store.persistent_key, {}))
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:   File "/home/ubuntu/dstack/blue/lib/python3.10/site-packages/dstack/gateway/core/persistent.py", line 27, in get_persistent_state
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:     state = json.load(f)
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:   File "/usr/lib/python3.10/json/__init__.py", line 293, in load
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:     return loads(fp.read(),
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:   File "/usr/lib/python3.10/json/__init__.py", line 346, in loads
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:     return _default_decoder.decode(s)
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:   File "/usr/lib/python3.10/json/decoder.py", line 337, in decode
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:   File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:     raise JSONDecodeError("Expecting value", s, err.value) from None
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: ERROR:    Application startup failed. Exiting.