dstackai / dstack

dstack is an easy-to-use and flexible container orchestrator for running AI workloads in any cloud or data center.
https://dstack.ai
Mozilla Public License 2.0
1.2k stars 87 forks source link

[Feature]: Shut off instances when they have finished a task and they don't hear from the server for some time #1047

Open spott opened 3 months ago

spott commented 3 months ago

Problem

Currently, the server is what shuts down an instance. So if it isn't running for whatever reason, an instance will sit idle running up an expensive bill in compute.

Solution

If an instance hasn't heard from the server for some set period of time (5 min? 10?), and is idle, then have it shut itself down.

Benefit

Save money, benefits the laptop users who only need the server to submit jobs during the day, and at night can shut their laptop and let jobs finish and shutdown the instances on their own overnight. Is a nice failsafe against any potential bugs in the server.

Alternatives

Find some way to keep the server running?

Would you like to help contributing this feature?

Yes

peterschmidt85 commented 3 months ago

@spott What if the job is very important, and the user expect it to keep on running even though the server is temporarily down?

spott commented 3 months ago

Sorry, the instance shouldn't stop the job, but after the job is done, if the instance hasn't heard from the server for 5 minutes and is idling, it should shut down.

peterschmidt85 commented 2 months ago

This issue is stale because it has been open for 30 days with no activity.

peterschmidt85 commented 1 month ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

spott commented 1 month ago

I'm still interested in this feature.

r4victor commented 1 month ago

@spott, dstack used to have this functionality: the instance would shut down itself eventually if the server failed to do so for some reason. And we only recently finished removing it. There were several reasons for the decision, mainly:

So, realistically speaking, we're unlikely to resurrect this feature any time soon. That said, I totally see the need to mitigate the possibility of forgotten idle instances when you start the dstack server locally via CLI. For this case, I'd suggest a different feature that we could implement:

When you interrupt dstack server, it could check for idle instances, and if there are any, it would tell you that and ask you if you're willing to exit anyway. It would solve most of the problems with forgetting idle instances. What do you say?

spott commented 1 month ago

When you interrupt dstack server, it could check for idle instances

Unfortunately, this doesn't really do much. An instance is only idle for 5 minutes by default after a job has finished, so unless you shutdown the server within that 5 minutes, you would always see no idle instances.

The issue is really that having a server that must always be online kills usability for all the laptop users out there, as they now have to keep their laptop awake in order to kill jobs when they are done (and if they aren't aware of that, it costs them money!). A lot of hobbyists who are using something like this don't want to provision a dstack server in the cloud. Thankfully dstack sky mitigates some of these issues for hobbyists, but that means any credits or money you have already put into GPU clouds is now forfeit.

I fully understand the issues that this causes, but I still think this would be valuable in reducing onboarding friction. I'm not alone either, as skypilot has deemed this important enough to add the option (it's called auto-stop).

peterschmidt85 commented 1 month ago

Thank you @spott for valuable feedback. Let us discuss this internally and come up with options.

r4victor commented 3 weeks ago

@spott, we've just released a new version of https://sky.dstack.ai that allows you to configure backends with your own cloud credentials just as you would do in dstack! It seems like it could be a solution to your problem.