Closed andy108369 closed 8 months ago
It is the third time he is losing the leases, and the reasons are a failing k8s node (or when the worker node is down).
It looks like when the cluster does not have enough room (resources) for the apps to start on the other available nodes, which then triggers the monitorMaxRetries
counter, then, leading to the provider closing the deployment as described in https://github.com/akash-network/support/issues/14 ;
13 Dec 2022
https://discord.com/channels/747885925232672829/1045829981676253285/1052000517053743135
cause: unknown (but I believe it's the same as the one below)19 Dec 2022
https://discord.com/channels/747885925232672829/1045829981676253285/1054464080213196851
cause: the problem that occurred after a regular system update restart4 Feb 2023
https://discord.com/channels/747885925232672829/1045829981676253285/1071499638043054131
cause: cluster node experiences network errorsprovider logs (after stripping off the deployment hostnames sed -e 's/service-name=.*/service-name==REDACTED/' -e 's/hostname=.*/hostname=REDACTED/'
) => 0-redacted.log
These logs do have some messages such as bid in unexpected state
but not for the dseq's in question (the closed ones mentioned above).
They don't contain the deployment failed. closing lease
(with the accumulating attempts
value up to 40
; refs), as, again, they contain the data after the incident occurred => starting on 2023-02-04|17:32:44.100
. And the leases were closed between 2023-02-04T17:13:16Z
... 2023-02-04T17:16:30Z
I've asked the provider whether he can find older logs find /var/log/pods -xdev -type f -name '*.log' -ls |grep provider
.
Update: unfortunately, the provider didn't save the logs before restarting the akash-provider pod.
It's the 4th time d3akash is losing its leases:
(source https://status.d3akash.cloud/status/status )
That's OK to have the provider close the leases when provider isn't charging for the lease that doesn't work for quite some time.
So I think the Alternative proposal (client-defined) would be ideal if the timeout (the amount of time when the lease is down because it cannot redeploy as the worker node is down) could be configured by the clients themselves in their SDL (say deployment_grace_period
in deployment manifests).
I'll close this issue in the favor of the Alternative proposal.
One of d3akash.cloud provider's K8s cluster nodes had a network error which caused it to close 32 leases.
I think the root cause is the https://github.com/akash-network/support/issues/14 ; waiting for the provider logs from the provider owner to confirm.
provider address:
akash1u5cdg7k3gl43mukca4aeultuz8x2j68mgwn28e
heights 9636347 ... 9636379 (
MsgCloseBid
) issued to 32 leasesleases before it closed them (height
9636346
)This is the height before provider started closing
32
leases.This is mainly to verify the
withdrawn
vsconsumed
to rule out the case where the provider could have been running without withdrawing the leases (e.g. due to some bug / misconfig) for some time until it was restarted.leases after the drop (height
9636379
)