Open jvstme opened 2 months ago
yea same issue
A rebooted gateway instance leads to services not working. Marking it as major since this doesn't have a simple workaround fix.
Fleet instances won't survive reboot either, at least on some backends (tested with gcp
) — they don't have dstack
public keys after reboot.
Sometimes the dstack-gateway application won't start at all after instance reboot. Recently a planned reboot lead to empty ~/dstack/state.json
file, so dstack-gateway failed to restart and gateway state was lost.
Oct 14 14:56:48 ip-172-31-30-166 sh[614992]: INFO: 127.0.0.1:36098 - "GET /api/stats/collect HTTP/1.1" 200 OK
Oct 14 14:56:52 ip-172-31-30-166 sh[614992]: INFO: 127.0.0.1:33052 - "GET /api/stats/collect HTTP/1.1" 200 OK
-- Boot 2744f08ca7ef4101be2445a849b165b3 --
Oct 14 14:57:24 ip-172-31-30-166 systemd[1]: Started dstack gateway service.
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: INFO: Started server process [395]
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: INFO: Waiting for application startup.
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: 2024-10-14 14:57:28,321 - dstack.gateway.core.persistent - DEBUG - Loading state from /home/ubuntu/dstack/state.json
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: ERROR: Traceback (most recent call last):
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: File "/home/ubuntu/dstack/blue/lib/python3.10/site-packages/starlette/routing.py", line 734, in lifespan
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: async with self.lifespan_context(app) as maybe_state:
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: return await anext(self.gen)
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: File "/home/ubuntu/dstack/blue/lib/python3.10/site-packages/dstack/gateway/main.py", line 24, in lifespan
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: store = get_store()
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: File "/home/ubuntu/dstack/blue/lib/python3.10/site-packages/dstack/gateway/core/store.py", line 346, in get_store
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: store = Store.model_validate(get_persistent_state().get(Store.persistent_key, {}))
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: File "/home/ubuntu/dstack/blue/lib/python3.10/site-packages/dstack/gateway/core/persistent.py", line 27, in get_persistent_state
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: state = json.load(f)
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: File "/usr/lib/python3.10/json/__init__.py", line 293, in load
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: return loads(fp.read(),
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: File "/usr/lib/python3.10/json/__init__.py", line 346, in loads
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: return _default_decoder.decode(s)
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: File "/usr/lib/python3.10/json/decoder.py", line 337, in decode
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: obj, end = self.raw_decode(s, idx=_w(s, 0).end())
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: raise JSONDecodeError("Expecting value", s, err.value) from None
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: ERROR: Application startup failed. Exiting.
Steps to reproduce
Actual behaviour
Previously created services no longer respond.
Expected behaviour
Previously created services should continue responding, since rebooting the gateway's instance is a rare but possible circumstance.
dstack version
master
Server logs
No response
Additional information
dstack-gateway
creates SSH tunnels to services and stores control sockets in the/tmp
directory that does not survive machine reboots.