akash-network / support

Akash Support and Issue Tracking
Apache License 2.0
6 stars 4 forks source link

Akash Provider incorrectly removes deployments during RPC communication issues; improve resilience to intermittent RPC failures #17

Open andy108369 opened 1 year ago

andy108369 commented 1 year ago

praetor-based provider-services: 0.1.0 provider address: akash1tweev0k42guyv3a2jtgphmgfrl2h5y2884vh9d

A provider owner SGC#3172 reported the j0asgbmq1a6p4s7ii0tlvuoco.ingress.dcnorse.ddns.net ingress host resource (for the DSEQ 9562948) started to return 404 and other deployments disappeared from his k8s cluster.

What's interesting is that I can see in the provider logs providerleasedips CRD does not exist messages, which is part of the praetor-based provider starting script check (you can see it below). Which means that there is some condition where K8s cluster does not seem to be fully initialized or so.. which is causing the provider think it's got no leases (lease query failed ... err=(MISSING) leading to the lease removal; though, it does not close them on the blockchain [bid, lease, order, deployment staying in the active/open state]).

leases are still active from the blockchain point of view

lease, order, bid, deployment for the dseq 9562948 are all open/active

$ export PROVIDER=akash1tweev0k42guyv3a2jtgphmgfrl2h5y2884vh9d; HEIGHT=$(akash query block | jq -r '.block.header.height'); provider-services query market lease list --height $HEIGHT --provider $PROVIDER --gseq 0 --oseq 0 --page 1 --limit 10000 --state active -o json | jq --argjson h $HEIGHT -r '["owner","dseq/gseq/oseq","rate","monthly","withdrawn","consumed","days"], (.leases[] | [(.lease.lease_id | .owner, (.dseq|tostring) + "/" + (.gseq|tostring) + "/" + (.oseq|tostring)), (.escrow_payment | (.rate.amount|tonumber), (.rate.amount|tonumber), (.withdrawn.amount|tonumber)), (($h-(.lease.created_at|tonumber))*(.escrow_payment.rate.amount|tonumber)/pow(10;6)), (($h-(.lease.created_at|tonumber))/((60/6)*60*24))]) | @csv' | awk -F ',' '{if (NR==1) {$1=$1; printf $0"\n"} else {block_time=6; rate_akt=(($4*((60/block_time)*60*24*30.436875))/10^6); $4=rate_akt; withdrawn_akt=($5/10^6); $5=withdrawn_akt; $6; $7; print $0}}' | column -t ; TZ=UTC date
"owner"                                         "dseq/gseq/oseq"  "rate"            "monthly"  "withdrawn"  "consumed"           "days"
"akash1gx5cmx7c8tm7vw0mlgq2gk5mj0d85lm6p5cf46"  "9574619/1/1"     1.2406            0.543744   0.102863     0.126342704          7.072222222222222
"akash1a2hukwlndnz7svs08z290dm67rjmtuhkq7tq5g"  "9534122/1/1"     3.31814697265625  1.45431    0.043895     0.4722652223242187   9.883888888888889
"akash17fqxak4kprh2rlatjlv9w04s9ugl7mn32ckut3"  "9610985/1/1"     6.3312            2.77491    0.294692     0.41452265759999996  4.5467361111111115
"akash17582ja9fw6k0m0gf9tm9g64w2ks6c7aqdntyst"  "9562948/1/1"     6.3312            2.77491    0.598817     0.718654512          7.882638888888889

Provider logs

See more complete logs for the past 7 days here (90Mi) => https://transfer.sh/Fg7vTc/logs.txt

After about 3 hours from the provider start, the following lines started to appear in the logs:

I[2023-02-07|10:03:45.939] using kube config file                       cmp=provider path=/home/akash/.kube/config
I[2023-02-07|10:03:48.058] using kube config file                       cmp=provider path=/home/akash/.kube/config
D[2023-02-07|10:03:48.078] service being found via autodetection        cmp=provider service=hostname-operator
D[2023-02-07|10:03:48.090] satisfying pending requests                  cmp=provider cmp=service-discovery-agent qty=1
...
...
I[2023-02-07|13:11:51.697] CRD manifest not found                       cmp=provider client=kube lease-ns=uvtrenbt6bch737mf10geqjbl3dmhb85pml2pkugogche
I[2023-02-07|13:11:52.797] CRD manifest not found                       cmp=provider client=kube lease-ns=uvtrenbt6bch737mf10geqjbl3dmhb85pml2pkugogche
...

Additional information

$ kubectl get crd |grep akash
providerhosts.akash.network       2022-11-08T10:23:32Z
manifests.akash.network           2022-11-08T10:23:32Z
providerleasedips.akash.network   2022-11-18T19:05:32Z
$ kubectl get ing -A
NAMESPACE        NAME                      CLASS                 HOSTS                               ADDRESS     PORTS   AGE
akash-services   akash-hostname-operator   akash-ingress-class   akash-hostname-operator.localhost   localhost   80      91d
# provider-services show-cluster-ns --provider akash1tweev0k42guyv3a2jtgphmgfrl2h5y2884vh9d --dseq 9562948 --owner akash17582ja9fw6k0m0gf9tm9g64w2ks6c7aqdntyst
uvtrenbt6bch737mf10geqjbl3dmhb85pml2pkugogche

# kubectl -n uvtrenbt6bch737mf10geqjbl3dmhb85pml2pkugogche get all -o wide
No resources found in uvtrenbt6bch737mf10geqjbl3dmhb85pml2pkugogche namespace.
for i in manifests providerhosts providerleasedips; do echo === $i ===; kubectl get $i -A; done
=== manifests ===
No resources found
=== providerhosts ===
No resources found
=== providerleasedips ===
No resources found
$ kubectl get nodes -o wide --show-labels
NAME    STATUS   ROLES                  AGE   VERSION        INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME         LABELS
node1   Ready    control-plane,master   91d   v1.25.3+k3s1   10.0.1.2      <none>        Ubuntu 22.04.1 LTS   5.15.0-58-generic   containerd://1.6.8-k3s1   akash.network/role=ingress,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=k3s,beta.kubernetes.io/os=linux,egress.k3s.io/cluster=true,ingress-ready=true,kubernetes.io/arch=amd64,kubernetes.io/hostname=node1,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=true,node-role.kubernetes.io/master=true,node.kubernetes.io/instance-type=k3s
$ kubectl describe nodes
Name:               node1
Roles:              control-plane,master
Labels:             akash.network/role=ingress
                    beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=k3s
                    beta.kubernetes.io/os=linux
                    egress.k3s.io/cluster=true
                    ingress-ready=true
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=node1
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=true
                    node-role.kubernetes.io/master=true
                    node.kubernetes.io/instance-type=k3s
Annotations:        flannel.alpha.coreos.com/backend-data: {"VNI":1,"VtepMAC":"0e:bf:9f:21:a8:32"}
                    flannel.alpha.coreos.com/backend-type: vxlan
                    flannel.alpha.coreos.com/kube-subnet-manager: true
                    flannel.alpha.coreos.com/public-ip: 10.0.1.2
                    k3s.io/hostname: node1
                    k3s.io/internal-ip: 10.0.1.2
                    k3s.io/node-args: ["server","--node-name","node1","--disable","traefik","--write-kubeconfig-mode","644"]
                    k3s.io/node-config-hash: DPLMREDACTED
                    k3s.io/node-env: {"K3S_DATA_DIR":"/var/lib/rancher/k3s/data/2ef87ff954adbb390309ce4dc07500f29c319f84feec1719bfb5059c8808ec6a"}
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 08 Nov 2022 10:22:21 +0000
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  node1
  AcquireTime:     <unset>
  RenewTime:       Tue, 07 Feb 2023 13:58:42 +0000
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Tue, 07 Feb 2023 13:58:10 +0000   Tue, 20 Dec 2022 15:19:44 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Tue, 07 Feb 2023 13:58:10 +0000   Tue, 20 Dec 2022 15:19:44 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Tue, 07 Feb 2023 13:58:10 +0000   Tue, 20 Dec 2022 15:19:44 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Tue, 07 Feb 2023 13:58:10 +0000   Tue, 20 Dec 2022 15:19:44 +0000   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  10.0.1.2
  Hostname:    node1
Capacity:
  cpu:                128
  ephemeral-storage:  559269800Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             65920684Ki
  pods:               110
Allocatable:
  cpu:                128
  ephemeral-storage:  544057661014
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             65920684Ki
  pods:               110
System Info:
  Machine ID:                 95c3f7828aed407aae0d5e325974c7b2
  System UUID:                95c3f782-8aed-407a-ae0d-5e325974c7b2
  Boot ID:                    131d6385-3504-4933-afe2-0b77d98a0fa2
  Kernel Version:             5.15.0-58-generic
  OS Image:                   Ubuntu 22.04.1 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.8-k3s1
  Kubelet Version:            v1.25.3+k3s1
  Kube-Proxy Version:         v1.25.3+k3s1
PodCIDR:                      10.42.0.0/24
PodCIDRs:                     10.42.0.0/24
ProviderID:                   k3s://node1
Non-terminated Pods:          (6 in total)
  Namespace                   Name                                         CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                         ------------  ----------  ---------------  -------------  ---
  akash-services              akash-hostname-operator-6b774d7b45-8c8lx     0 (0%)        0 (0%)      0 (0%)           0 (0%)         80d
  kube-system                 local-path-provisioner-5b5579c644-7rfp2      0 (0%)        0 (0%)      0 (0%)           0 (0%)         91d
  kube-system                 coredns-75fc8f8fff-thqqn                     100m (0%)     0 (0%)      70Mi (0%)        170Mi (0%)     91d
  kube-system                 metrics-server-5c8978b444-7wd4g              100m (0%)     0 (0%)      70Mi (0%)        0 (0%)         91d
  akash-services              akash-node-1-0                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         15d
  ingress-nginx               ingress-nginx-controller-6898f4d494-4gs4k    100m (0%)     0 (0%)      90Mi (0%)        0 (0%)         80d
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                300m (0%)   0 (0%)
  memory             230Mi (0%)  170Mi (0%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
Events:              <none>
$ ps -ef|grep -E 'akash|provider'
akash       4053    3780  1 10:03 ?        00:02:40 /metrics-server --cert-dir=/tmp --secure-port=10250 --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname --kubelet-use-node-status-port --metric-resolution=15s
root        4223    4002  0 10:03 ?        00:00:00 /usr/bin/tini -- provider-services hostname-operator
root        4255    4223  0 10:03 ?        00:00:05 provider-services hostname-operator
root        9198    3671 79 10:03 ?        02:56:12 /bin/akash start
akash       9371       1  0 10:03 ?        00:00:00 bash /home/akash/.praetor/start-provider.sh
akash       9444    9371  1 10:03 ?        00:02:45 /home/akash/bin/provider-services run --home /home/akash/.akash --chain-id akashnet-2 --withdrawal-period 1h30m0s --minimum-balance 500000 --keyring-backend=file --from akash1tweev0k42guyv3a2jtgphmgfrl2h5y2884vh9d --gas auto --gas-prices 0.025uakt --gas-adjustment 1.30 --kubeconfig /home/akash/.kube/config --cluster-k8s true --deployment-ingress-domain ingress.dcnorse.ddns.net --deployment-ingress-static-hosts true --deployment-runtime-class none --bid-price-strategy scale --bid-price-cpu-scale 0.01 --bid-price-memory-scale 0.0025 --bid-price-storage-scale 0.0001 --bid-deposit 5000000uakt --cluster-node-port-quantity 1000 --cluster-public-hostname praetor.ingress.dcnorse.ddns.net
root       13260    2190  0 13:06 ?        00:00:00 sshd: akash [priv]
akash      13263       1  0 13:06 ?        00:00:00 /lib/systemd/systemd --user
akash      13264   13263  0 13:06 ?        00:00:00 (sd-pam)
akash      13366   13260  0 13:06 ?        00:00:03 sshd: akash@pts/0
akash      13367   13366  0 13:06 pts/0    00:00:00 -bash
akash      14453   13367  0 13:46 pts/0    00:00:00 ps -ef
akash      14454   13367  0 13:46 pts/0    00:00:00 grep --color=auto -E akash|provider
$ /home/akash/bin/provider-services version
v0.1.0
$ cat /etc/systemd/system/akash-provider.service
[Unit]
Description=Akash Provider
After=network.target

[Service]
User=akash
ExecStart=/home/akash/.praetor/start-provider.sh
KillSignal=SIGINT
Restart=always
RestartSec=15
StartLimitInterval=200
StartLimitBurst=10
#LimitNOFILE=65535

[Install]
WantedBy=multi-user.target
akash@akash0001:~$ cat /home/akash/.praetor/start-provider.sh
#!/usr/bin/env bash

cd ~/.praetor

#maps ip of local RPC node
export AKASH_NODE="http://$(kubectl -n akash-services get ep akash-node-1 -o jsonpath='{.subsets[0].addresses[0].ip}'):26657"

# Check if the providerleasedips CRD exists
if kubectl get crd | grep -q providerleasedips; then
  # Command to execute if the CRD exists
  (
    sleep 2s
    cat key-pass.txt
    cat key-pass.txt
  ) | ~/bin/provider-services run \
    --home ~/.akash \
    --chain-id akashnet-2 \
    --withdrawal-period 1h30m0s \
    --minimum-balance 500000 \
    --keyring-backend=file \
    --from akash1tweev0k42guyv3a2jtgphmgfrl2h5y2884vh9d \
    --gas auto \
    --gas-prices 0.025uakt \
    --gas-adjustment 1.30 \
    --kubeconfig ~/.kube/config \
    --cluster-k8s true \
    --deployment-ingress-domain ingress.dcnorse.ddns.net \
    --deployment-ingress-static-hosts true \
    --deployment-runtime-class none \
    --bid-price-strategy scale \
    --bid-price-cpu-scale 0.01 \
    --bid-price-memory-scale 0.0025 \
    --bid-price-storage-scale 0.0001 \
    --bid-deposit 5000000uakt \
    --cluster-node-port-quantity 1000 \
    --cluster-public-hostname praetor.ingress.dcnorse.ddns.net 
else
  echo "providerleasedips CRD does not exist."
fi
andy108369 commented 1 year ago

interesting.. there is absolutely no resources related to uvtrenbt6bch737mf10geqjbl3dmhb85pml2pkugogche namespace (checked all CRDs, k8s resources):

akash@akash0001:~$ provider-services show-cluster-ns --provider akash1tweev0k42guyv3a2jtgphmgfrl2h5y2884vh9d --dseq 9562948 --owner akash17582ja9fw6k0m0gf9tm9g64w2ks6c7aqdntyst
uvtrenbt6bch737mf10geqjbl3dmhb85pml2pkugogche
akash@akash0001:~$ kubectl api-resources --verbs=list --namespaced -o name   | xargs -n 1 kubectl get --show-kind --ignore-not-found -A | grep -iE 'uvtrenbt6bch737mf10geqjbl3dmhb85pml2pkugogche|9562948|akash17582ja9fw6k0m0gf9tm9g64w2ks6c7aqdntyst'

and yet, the provider keeps on reporting every few seconds:

CRD manifest not found                       cmp=provider client=kube lease-ns=uvtrenbt6bch737mf10geqjbl3dmhb85pml2pkugogche

(have bounced the hostname-operator too)

andy108369 commented 1 year ago

interesting.. there is absolutely no resources related to uvtrenbt6bch737mf10geqjbl3dmhb85pml2pkugogche namespace (checked all CRDs, k8s resources):

akash@akash0001:~$ provider-services show-cluster-ns --provider akash1tweev0k42guyv3a2jtgphmgfrl2h5y2884vh9d --dseq 9562948 --owner akash17582ja9fw6k0m0gf9tm9g64w2ks6c7aqdntyst
uvtrenbt6bch737mf10geqjbl3dmhb85pml2pkugogche
akash@akash0001:~$ kubectl api-resources --verbs=list --namespaced -o name   | xargs -n 1 kubectl get --show-kind --ignore-not-found -A | grep -iE 'uvtrenbt6bch737mf10geqjbl3dmhb85pml2pkugogche|9562948|akash17582ja9fw6k0m0gf9tm9g64w2ks6c7aqdntyst'

and yet, the provider keeps on reporting every few seconds:

CRD manifest not found                       cmp=provider client=kube lease-ns=uvtrenbt6bch737mf10geqjbl3dmhb85pml2pkugogche

(have bounced the hostname-operator too)

oh, ignore it. this like just the owner is trying to monitor it, e.g.

provider-services lease-status --provider akash1tweev0k42guyv3a2jtgphmgfrl2h5y2884vh9d --dseq 9562948 --from akash17582ja9fw6k0m0gf9tm9g64w2ks6c7aqdntyst

I've tried running that using my key:

provider-services lease-status --provider akash1tweev0k42guyv3a2jtgphmgfrl2h5y2884vh9d --dseq 9562948 --from default

and it produced this line, even I don't have any deployments on that provider:

I[2023-02-07|14:48:48.452] CRD manifest not found                       cmp=provider client=kube lease-ns=i7t9u9ie3mnj06o9ovcsoq5fubpuj5k1mmfqr9p1ts9ga

ns (namespace) can be derived manually:

# provider-services show-cluster-ns --provider akash1tweev0k42guyv3a2jtgphmgfrl2h5y2884vh9d --dseq 9562949 --owner akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h
i7t9u9ie3mnj06o9ovcsoq5fubpuj5k1mmfqr9p1ts9ga
andy108369 commented 1 year ago

I've tmate-ed to his provider, have tried deploying my app there (dseq 9677551) and restarting akash-provider after that => no issues. :man_shrugging:

andy108369 commented 1 day ago

TL;DR

akash-provider appeared to encounter intermittent communication issues with akash-node (via RPC), leading to errors like couldn't check lease balance. retrying in 1m and lease query failed ... err=(MISSING). These errors initiated a deployment shutdown cascade, causing the akash-provider to take down multiple deployments (44 in total).

Issue Overview:

This issue reoccurred on the Hurricane provider, running akash-provider 0.6.5-rc6.

A total of 44 deployments (worth noting - not all of them) were closed simultaneously. For detailed analysis, let's focus on dseq 18728134.

Relevant Logs:
  1. Error: lease query failed followed by err=(MISSING)

  2. Deployment was removed at 2024-11-25|01:05:55.661:

    2024-11-25 02:05:55.720 {"log":"2024-11-25T01:05:55.703] shutting down module=provider-cluster cmp=deployment-manager lease=akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys/18728134/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk"}
    2024-11-25 02:05:55.661 {"log":"2024-11-25T01:05:55.661] lease query failed module=provider-cluster cmp=deployment-manager err=(MISSING)"}

    More logs for this lease: Explore-logs-2024-11-25 11_21_30.txt


Pod Restart Analysis:

NOTE: During the initial line of logs (provider-services in provider mode), the message using in cluster kube config was noted consistently as expected after the akash-provider pod restart.


Error Context:


Logs by Restart Timing:

  1. First Provider Pod Restart (leading to deployment removals):

  2. Second Provider Pod Restart (no issues such as deployments being closed observed):


Preliminary Conclusion:

The root cause appears to stem from the cascading effects of:

  1. lease query failed: Leading to missing lease information.
  2. Balance Checker Errors: Resulting in unfulfilled balance checks and retrials.
  3. Deployments Shutdown Cascade: Akash Provider erroneously removed Akash deployments (from K8s) without issuing lease close transactions on the Akash Blockchain.

Further investigation is required to determine:

andy108369 commented 1 day ago

Further investigation is required to determine:

  • The reason for the err=(MISSING) in lease queries. (My guess based on the logs, the reason is likely the intermittent communication issues of akash-provider with akash-node (via RPC)).
  • The root cause of the unexpected pod restarts ~40 minutes before the critical shutdown. (This potentially could be caused by the intermittent issues with the RPC pod where the health check would trigger provider pod restart)

I can confirm my earlier assumption: there was indeed an intermittent issue with the RPC. Upon searching for relevant error messages, I found RPC node sync check failed in the provider logs. image

@troian The Akash Provider should not assume deployments are absent simply due to RPC query failures or issues with the RPC itself. It should be designed to handle intermittent communication or RPC disruptions more resiliently without removing the existing deployments from K8s.