Open andy108369 opened 1 year ago
interesting.. there is absolutely no resources related to uvtrenbt6bch737mf10geqjbl3dmhb85pml2pkugogche
namespace (checked all CRDs, k8s resources):
akash@akash0001:~$ provider-services show-cluster-ns --provider akash1tweev0k42guyv3a2jtgphmgfrl2h5y2884vh9d --dseq 9562948 --owner akash17582ja9fw6k0m0gf9tm9g64w2ks6c7aqdntyst
uvtrenbt6bch737mf10geqjbl3dmhb85pml2pkugogche
akash@akash0001:~$ kubectl api-resources --verbs=list --namespaced -o name | xargs -n 1 kubectl get --show-kind --ignore-not-found -A | grep -iE 'uvtrenbt6bch737mf10geqjbl3dmhb85pml2pkugogche|9562948|akash17582ja9fw6k0m0gf9tm9g64w2ks6c7aqdntyst'
and yet, the provider keeps on reporting every few seconds:
CRD manifest not found cmp=provider client=kube lease-ns=uvtrenbt6bch737mf10geqjbl3dmhb85pml2pkugogche
(have bounced the hostname-operator too)
interesting.. there is absolutely no resources related to
uvtrenbt6bch737mf10geqjbl3dmhb85pml2pkugogche
namespace (checked all CRDs, k8s resources):akash@akash0001:~$ provider-services show-cluster-ns --provider akash1tweev0k42guyv3a2jtgphmgfrl2h5y2884vh9d --dseq 9562948 --owner akash17582ja9fw6k0m0gf9tm9g64w2ks6c7aqdntyst uvtrenbt6bch737mf10geqjbl3dmhb85pml2pkugogche
akash@akash0001:~$ kubectl api-resources --verbs=list --namespaced -o name | xargs -n 1 kubectl get --show-kind --ignore-not-found -A | grep -iE 'uvtrenbt6bch737mf10geqjbl3dmhb85pml2pkugogche|9562948|akash17582ja9fw6k0m0gf9tm9g64w2ks6c7aqdntyst'
and yet, the provider keeps on reporting every few seconds:
CRD manifest not found cmp=provider client=kube lease-ns=uvtrenbt6bch737mf10geqjbl3dmhb85pml2pkugogche
(have bounced the hostname-operator too)
oh, ignore it. this like just the owner is trying to monitor it, e.g.
provider-services lease-status --provider akash1tweev0k42guyv3a2jtgphmgfrl2h5y2884vh9d --dseq 9562948 --from akash17582ja9fw6k0m0gf9tm9g64w2ks6c7aqdntyst
I've tried running that using my key:
provider-services lease-status --provider akash1tweev0k42guyv3a2jtgphmgfrl2h5y2884vh9d --dseq 9562948 --from default
and it produced this line, even I don't have any deployments on that provider:
I[2023-02-07|14:48:48.452] CRD manifest not found cmp=provider client=kube lease-ns=i7t9u9ie3mnj06o9ovcsoq5fubpuj5k1mmfqr9p1ts9ga
ns (namespace) can be derived manually:
# provider-services show-cluster-ns --provider akash1tweev0k42guyv3a2jtgphmgfrl2h5y2884vh9d --dseq 9562949 --owner akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h
i7t9u9ie3mnj06o9ovcsoq5fubpuj5k1mmfqr9p1ts9ga
I've tmate
-ed to his provider, have tried deploying my app there (dseq 9677551) and restarting akash-provider after that => no issues. :man_shrugging:
akash-provider
appeared to encounter intermittent communication issues with akash-node
(via RPC), leading to errors like couldn't check lease balance. retrying in 1m
and lease query failed ... err=(MISSING)
. These errors initiated a deployment shutdown cascade, causing the akash-provider to take down multiple deployments (44 in total).
This issue reoccurred on the Hurricane provider, running akash-provider 0.6.5-rc6
.
A total of 44 deployments (worth noting - not all of them) were closed simultaneously. For detailed analysis, let's focus on dseq 18728134
.
Error: lease query failed
followed by err=(MISSING)
Deployment was removed at 2024-11-25|01:05:55.661
:
2024-11-25 02:05:55.720 {"log":"2024-11-25T01:05:55.703] shutting down module=provider-cluster cmp=deployment-manager lease=akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys/18728134/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk"}
2024-11-25 02:05:55.661 {"log":"2024-11-25T01:05:55.661] lease query failed module=provider-cluster cmp=deployment-manager err=(MISSING)"}
More logs for this lease: Explore-logs-2024-11-25 11_21_30.txt
NOTE: During the initial line of logs (
provider-services
in provider mode), the messageusing in cluster kube config
was noted consistently as expected after the akash-provider pod restart.
Third Provider Pod Restart: Triggered by an account sequence mismatch
error.
Two Other Restarts: Reason unknown, occurred approximately 40 minutes prior to the third restart.
Error Leading to Deployments Shutdown:
couldn't check lease balance. retrying in 1m module=provider cmp=balance-checker leaseId=akash1qh0f0h7jlq4x5gpxghrxvps5l09y7uuvcumcyd/18084950/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
Subsequent logs indicate failed attempts to query open orders and deployments shutdown cascade:
2024-11-25 01:05:55.703 finding existing orders err="post failed: Post \"http://akash-node-1:26657\": EOF"
2024-11-25 01:05:55.703 shutting down module=provider cmp=balance-checker
Impact:
First Provider Pod Restart (leading to deployment removals):
Second Provider Pod Restart (no issues such as deployments being closed observed):
The root cause appears to stem from the cascading effects of:
lease query failed
: Leading to missing lease information.Further investigation is required to determine:
err=(MISSING)
in lease queries. (My guess based on the logs, the reason is likely the intermittent communication issues of akash-provider
with akash-node
(via RPC)).Further investigation is required to determine:
- The reason for the
err=(MISSING)
in lease queries. (My guess based on the logs, the reason is likely the intermittent communication issues ofakash-provider
withakash-node
(via RPC)).- The root cause of the unexpected pod restarts ~40 minutes before the critical shutdown. (This potentially could be caused by the intermittent issues with the RPC pod where the health check would trigger provider pod restart)
I can confirm my earlier assumption: there was indeed an intermittent issue with the RPC. Upon searching for relevant error messages, I found RPC node sync check failed in the provider logs.
@troian The Akash Provider should not assume deployments are absent simply due to RPC query failures or issues with the RPC itself. It should be designed to handle intermittent communication or RPC disruptions more resiliently without removing the existing deployments from K8s.
praetor-based provider-services:
0.1.0
provider address:akash1tweev0k42guyv3a2jtgphmgfrl2h5y2884vh9d
A provider owner
SGC#3172
reported thej0asgbmq1a6p4s7ii0tlvuoco.ingress.dcnorse.ddns.net
ingress host resource (for the DSEQ9562948
) started to return 404 and other deployments disappeared from his k8s cluster.What's interesting is that I can see in the provider logs
providerleasedips CRD does not exist
messages, which is part of the praetor-based provider starting script check (you can see it below). Which means that there is some condition where K8s cluster does not seem to be fully initialized or so.. which is causing the provider think it's got no leases (lease query failed ... err=(MISSING)
leading to the lease removal; though, it does not close them on the blockchain [bid, lease, order, deployment staying in the active/open state]).leases are still active from the blockchain point of view
Provider logs
See more complete logs for the past
7 days
here (90Mi
) => https://transfer.sh/Fg7vTc/logs.txtAfter about
3 hours
from the provider start, the following lines started to appear in the logs:grep MISSING logs.txt
Additional information