Open spott opened 3 months ago
@spott What if the job is very important, and the user expect it to keep on running even though the server is temporarily down?
Sorry, the instance shouldn't stop the job, but after the job is done, if the instance hasn't heard from the server for 5 minutes and is idling, it should shut down.
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale.
I'm still interested in this feature.
@spott, dstack used to have this functionality: the instance would shut down itself eventually if the server failed to do so for some reason. And we only recently finished removing it. There were several reasons for the decision, mainly:
So, realistically speaking, we're unlikely to resurrect this feature any time soon. That said, I totally see the need to mitigate the possibility of forgotten idle instances when you start the dstack server locally via CLI. For this case, I'd suggest a different feature that we could implement:
When you interrupt dstack server, it could check for idle instances, and if there are any, it would tell you that and ask you if you're willing to exit anyway. It would solve most of the problems with forgetting idle instances. What do you say?
When you interrupt dstack server, it could check for idle instances
Unfortunately, this doesn't really do much. An instance is only idle for 5 minutes by default after a job has finished, so unless you shutdown the server within that 5 minutes, you would always see no idle instances.
The issue is really that having a server that must always be online kills usability for all the laptop users out there, as they now have to keep their laptop awake in order to kill jobs when they are done (and if they aren't aware of that, it costs them money!). A lot of hobbyists who are using something like this don't want to provision a dstack server in the cloud. Thankfully dstack sky mitigates some of these issues for hobbyists, but that means any credits or money you have already put into GPU clouds is now forfeit.
I fully understand the issues that this causes, but I still think this would be valuable in reducing onboarding friction. I'm not alone either, as skypilot has deemed this important enough to add the option (it's called auto-stop).
Thank you @spott for valuable feedback. Let us discuss this internally and come up with options.
@spott, we've just released a new version of https://sky.dstack.ai that allows you to configure backends with your own cloud credentials just as you would do in dstack! It seems like it could be a solution to your problem.
Problem
Currently, the server is what shuts down an instance. So if it isn't running for whatever reason, an instance will sit idle running up an expensive bill in compute.
Solution
If an instance hasn't heard from the server for some set period of time (5 min? 10?), and is idle, then have it shut itself down.
Benefit
Save money, benefits the laptop users who only need the server to submit jobs during the day, and at night can shut their laptop and let jobs finish and shutdown the instances on their own overnight. Is a nice failsafe against any potential bugs in the server.
Alternatives
Find some way to keep the server running?
Would you like to help contributing this feature?
Yes