andy108369 commented 1 year ago

Provider has a setting monitorMaxRetries which is set to 40, https://github.com/akash-network/provider/blob/v0.1.0/cluster/monitor.go#L26 , so when one of the worker nodes gets **removed*** during the maintenance (say for downsizing the k8s cluster), the pods which do not have enough room to start will be getting closed.

* - I haven't tested with stopping the k8s worker node, but I presume the result is going to be same.

The best thing to do during the maintenance is to stop akash-provider service in order to prevent the attempts counter from incrementing, reaching the value of 40 which makes provider believe the deployment failed, closing the lease (if that's the case then you can spot this message in the provider's logs: deployment failed. closing lease with accumulating attempts value.) https://github.com/akash-network/provider/blob/v0.1.0/cluster/monitor.go#L162

To stop the `akash-provider` service

kubectl -n akash-services get statefulsets
kubectl -n akash-services scale statefulsets akash-provider --replicas=0

Verify

Make sure you see 0/0 or you see no akash-provider pods running:

kubectl -n akash-services get statefulsets
kubectl -n akash-services get pods -l app=akash-provide

means akash-provider is stopped now.

To start it back again

Only when the maintenance is fully complete and all k8s nodes are back up with enough resources to host the pods.

kubectl -n akash-services scale statefulsets akash-provider --replicas=1

Doc https://docs.akash.network/providers/akash-provider-troubleshooting/provider-maintenance

andy108369 commented 1 year ago

Or monitorMaxRetries logic should not be applied to the deployments which have already been successfully deployed at least once. OTOH, we don't want forever Pending pods...

Have informed providers here https://discord.com/channels/747885925232672829/771909807359262751/1061981326246424596

andy108369 commented 1 year ago

I have just discussed this internally. I'm going to make a proposal where the provider can set the grace period from a few hours to days so that in the event of the worker node crash, the apps may have a higher chance to start back up again. This is especially important when the apps have had persistent storage mounted, so the clients can regain their data once the worker node is restored.

andy108369 commented 1 year ago

The proposal => https://github.com/orgs/akash-network/discussions/160

troian commented 10 months ago

superseded by #121

arno01 commented 9 months ago

superseded by #121

let's keep this open since this issue is different,

121 is `kube-builder: ClusterParams() returned result of unexpected type (%!s(<nil>))`

and this one is related to monitorMaxRetries when K8s runs out of resources to redeploy the workloads (i.e. even in the event when one of the worker nodes falls off for a short period of time)

andy108369 commented 8 months ago

NIC's died, leases died.

That's OK when provider isn't charging for the lease that doesn't work for quite some time.

So I think the Alternative proposal (client-defined) would be ideal if the timeout (the amount of time when the lease is down because it cannot redeploy as the worker node is down) could be configured by the clients themselves in their SDL (say deployment_grace_period in deployment manifests).

I'll close this issue in the favor of the Alternative proposal.

akash-network / support

provider will remove the leases during the k8s maintenance #14

To stop the `akash-provider` service

Verify

To start it back again

121 is `kube-builder: ClusterParams() returned result of unexpected type (%!s(<nil>))`

akash-network / support

provider will remove the leases during the k8s maintenance #14

To stop the akash-provider service

Verify

To start it back again

121 is kube-builder: ClusterParams() returned result of unexpected type (%!s(<nil>))

To stop the `akash-provider` service

121 is `kube-builder: ClusterParams() returned result of unexpected type (%!s(<nil>))`