akash-network / support

Akash Support and Issue Tracking
Apache License 2.0
6 stars 4 forks source link

Feature request: implement client-defined deployment grace period #144

Open andy108369 opened 1 year ago

andy108369 commented 1 year ago

I think the Alternative proposal (client-defined) would be ideal if the timeout (the amount of time when the lease is down because it cannot redeploy as the worker node is down) could be configured by the clients themselves in their SDL (say deployment_grace_period or tolerate_downtime in deployment manifests).

This way the clients could specify the amount of time (in hours/days) they can tolerate their app being down.

It could also be useful when providers have been running a deployment with persistent storage and they aren't willing to lose it and can accept the downtime measured even in days, just so that they will be able to get their data back. (I know they should backup the data, monitor the backups; but many don't and backups can get stale/break or corrupted)

SGC41 commented 1 year ago

i think the default of such setting should also be pretty high... the only good reason to kill a deployment due to some downtime, would be in some sort of HA setup where replacements will have taken over the work anyways.

all these fast lease closures gives tenants are grief. especially in the case of persistent storage, where they could have been working on their deployment for weeks, making potentially massive changes, which might only exist in the persistent storage.

which then is destroyed after just 10 or 30 minutes of issues from the provider side.

vpavlin commented 6 months ago

I am not 100% sure this is the same case, but I'll put it here just in case it is:)

I had a deployment which takes ~20mins to start (syncing some on-chain data) and then I needed to update the image. I screwed up the image format, so the deployment failed. I then fixed the image format, but it seemed like the scheduler/node did not pick up the fix (probably due to backoff) before the lease got automatically closed.

For this case I could definitely see mysel setting this grace period to an hour if that would mean I'd not have to start from scratch in case of a small mess-up:)

andy108369 commented 5 days ago

FWIW: The provider-defined deployment grace period can be adjusted since the provider v0.6.4: https://github.com/akash-network/provider/commit/99cb9ac7912e6ef5eafb7403d4ee767da2f368a7

I'll update the helm charts to support this in https://github.com/akash-network/helm-charts/issues/289