akash-network / support

Akash Support and Issue Tracking
5 stars 3 forks source link

Add documentation for provider metalLB troubleshooting #129

Open andy108369 opened 9 months ago

andy108369 commented 9 months ago

europlots provider v0.4.6 (akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc) RPC node is 0.26.1 (we have tried different RPC nodes too)

I[2023-09-28|09:42:54.027] order detected                               module=bidengine-service cmp=provider order=order/akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13002070/1/1 
I[2023-09-28|09:42:54.029] group fetched                                module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13002070/1/1         
I[2023-09-28|09:42:54.029] requesting reservation                       module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13002070/1/1          

Complete Logs

https://transfer.sh/KVMK4nhuW3/provider-not-bidding-dseq-13002055.log

Additional notes

It looks like the last time the provider created a bid was before the mainnet upgrade (0.24 => 0.26) https://www.mintscan.io/akash/tx/6CF9A4C689CB378572AE161F8F8E15059A9C08AF38882182C50B37567E330FBC

RPC looks good, provider withdraws from the leases, detects the order request, but doesn't go further (not checking whether it has enough resources to host the deployment, nor submits the bid).

cmp=inventory-service fell off?

The next messages such as reservation requested & reservation count, should be coming from the cmp=inventory-service, but there are none.

Here is the expected workflow (acc. to the logs) I took from the Hurricane provider:

cluster resources dump message appears only once

Also, the messages such as cluster resources dump (also cmp=inventory-service component) would normally appear in the logs frequently (every few minutes) if looking at other clusters, but he had them only once (when provider-services started).

andy108369 commented 9 months ago

"Fixed" the provider bidding issue

"Fixed" since that's only a workaround, akash-provider should fail to deploy the app that can't get the leased IPs ; and it should also not get stuck because of that / self-recover

After investigating the Europlots provider, I found that there was a deployment (~20 hours ago) created, with 3 leased IPs, however it had never gotten the IP assigned. This was breaking the inventory-service in the akash-provider.

To resolve it, I've closed the bid of that deployment and then had to bounce akash-provider pod for it to start bidding again. Logs https://transfer.sh/xmB3HiBXTG/provider-is-bidding-again-dseq-13002853.log

Highlights

Logs from the investigation

root@control1:~# kubectl get events -A --sort-by='.metadata.creationTimestamp'   |grep -Ev 'scan-vuln|akash-provider|operator'                                                              │
NAMESPACE                                       LAST SEEN   TYPE      REASON             OBJECT                                           MESSAGE                                           │
fseibv1hr266elp9crb06v2s2nb4s96qq387nhv56fdgq   42m         Warning   AllocationFailed   service/app-ip-80-tcp                            Failed to allocate IP for "fseibv1hr266elp9crb06v2│
s2nb4s96qq387nhv56fdgq/app-ip-80-tcp": no available IPs                                                                                                                                     │
fseibv1hr266elp9crb06v2s2nb4s96qq387nhv56fdgq   42m         Warning   AllocationFailed   service/app-ip-443-tcp                           Failed to allocate IP for "fseibv1hr266elp9crb06v2│
s2nb4s96qq387nhv56fdgq/app-ip-443-tcp": no available IPs                                                                                                                                    │
fseibv1hr266elp9crb06v2s2nb4s96qq387nhv56fdgq   42m         Warning   AllocationFailed   service/app-ip-22-tcp                            Failed to allocate IP for "fseibv1hr266elp9crb06v2│
s2nb4s96qq387nhv56fdgq/app-ip-22-tcp": no available IPs    
root@control1:~# kubectl -n lease get manifest fseibv1hr266elp9crb06v2s2nb4s96qq387nhv56fdgq
NAME                                            AGE
fseibv1hr266elp9crb06v2s2nb4s96qq387nhv56fdgq   20h
root@control1:~# kubectl get manifest -A --sort-by='.metadata.creationTimestamp'   | tail -2
lease       ne6ubpns0mj0kb6l268fotikc272h8eibilpknbehsgrm   22h
lease       fseibv1hr266elp9crb06v2s2nb4s96qq387nhv56fdgq   20h   <<<<<<<<<<<< THIS one

root@control1:~# kubectl -n lease get manifest fseibv1hr266elp9crb06v2s2nb4s96qq387nhv56fdgq -o yaml | grep -i -A5 ip
        ip: haproxy_ip_server-3
        port: 22
        proto: TCP
      - endpoint_sequence_number: 1
        external_port: 80
        global: true
--
        ip: haproxy_ip_server-3
        port: 80
        proto: TCP
      - endpoint_sequence_number: 1
        external_port: 443
        global: true
--
        ip: haproxy_ip_server-3
        port: 443
        proto: TCP
      image: ubuntu:latest
      name: app
      resources:
root@control1:~#

root@control1:~# kubectl -n fseibv1hr266elp9crb06v2s2nb4s96qq387nhv56fdgq get all
NAME                       READY   STATUS    RESTARTS   AGE
pod/app-794bbc75dc-gq54z   1/1     Running   0          20h

NAME                     TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
service/app              ClusterIP      10.233.19.168   <none>        80/TCP                       20h
service/app-ip-22-tcp    LoadBalancer   10.233.27.253   <pending>     22:31562/TCP                 20h
service/app-ip-443-tcp   LoadBalancer   10.233.27.174   <pending>     443:31410/TCP                20h
service/app-ip-80-tcp    LoadBalancer   10.233.40.120   <pending>     80:32591/TCP                 20h
service/app-np           NodePort       10.233.43.87    <none>        22:30649/TCP,443:30161/TCP   20h

NAME                  READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/app   1/1     1            1           20h

NAME                             DESIRED   CURRENT   READY   AGE
replicaset.apps/app-794bbc75dc   1         1         1       20h
root@control1:~#

root@control1:~# kubectl -n metallb-system logs metallb-controller-7f6b8b7fdd-v95lp --tail=100 | grep error
{"caller":"service.go:140","error":"no available IPs","level":"error","msg":"IP allocation failed","op":"allocateIPs","ts":"2023-09-28T08:19:36Z"}
{"caller":"service.go:140","error":"no available IPs","level":"error","msg":"IP allocation failed","op":"allocateIPs","ts":"2023-09-28T08:19:36Z"}
{"caller":"service.go:140","error":"no available IPs","level":"error","msg":"IP allocation failed","op":"allocateIPs","ts":"2023-09-28T09:18:23Z"}
{"caller":"service.go:140","error":"no available IPs","level":"error","msg":"IP allocation failed","op":"allocateIPs","ts":"2023-09-28T09:18:23Z"}
{"caller":"service.go:140","error":"no available IPs","level":"error","msg":"IP allocation failed","op":"allocateIPs","ts":"2023-09-28T09:18:23Z"}
root@control1:~#

FWIW, IPAddressPool reports 194.28.98.219-194.28.98.220 range, while the provider's metallb-config.yaml has 194.28.98.218-194.28.98.220

194.28.98.219 has already been allocated to some other app:

root@control1:~# kubectl get svc -A -l akash.network |grep LoadBalancer | column -t                                              
9a2cmcvit3r2bm1dn6vna8dm82dr887fm69atjdfcnp4c  app-ip-3333-tcp  LoadBalancer  10.233.18.185  194.28.98.219  3333:31042/TCP  24h  
9a2cmcvit3r2bm1dn6vna8dm82dr887fm69atjdfcnp4c  app-ip-8585-tcp  LoadBalancer  10.233.58.174  194.28.98.219  8585:31162/TCP  24h  
fm8d7v04vkef5iqu2mpfbt07dm1ilp5930rnj2umg08k8  app-ip-1414-tcp  LoadBalancer  10.233.58.8    194.28.98.220  1414:31711/TCP  3d23h
fm8d7v04vkef5iqu2mpfbt07dm1ilp5930rnj2umg08k8  app-ip-1515-tcp  LoadBalancer  10.233.44.173  194.28.98.220  1515:32491/TCP  3d23h
fseibv1hr266elp9crb06v2s2nb4s96qq387nhv56fdgq  app-ip-22-tcp    LoadBalancer  10.233.27.253  <pending>      22:31562/TCP    20h  
fseibv1hr266elp9crb06v2s2nb4s96qq387nhv56fdgq  app-ip-443-tcp   LoadBalancer  10.233.27.174  <pending>      443:31410/TCP   20h  
fseibv1hr266elp9crb06v2s2nb4s96qq387nhv56fdgq  app-ip-80-tcp    LoadBalancer  10.233.40.120  <pending>      80:32591/TCP    20h  
root@control1:~#                                                                                                                 
root@control1:~# kubectl -n metallb-system get pods                                
NAME                                  READY   STATUS    RESTARTS       AGE         
metallb-controller-7f6b8b7fdd-v95lp   1/1     Running   0              28d         
metallb-speaker-25s86                 1/1     Running   1 (140d ago)   184d        
metallb-speaker-dkhb4                 1/1     Running   8 (25d ago)    184d        
metallb-speaker-r4lrx                 1/1     Running   4 (25d ago)    184d        
metallb-speaker-tm5d5                 1/1     Running   15 (25d ago)   184d        
metallb-speaker-v5vh9                 1/1     Running   2 (70d ago)    184d        
root@control1:~# kubectl -n metallb-system get svc                                 
NAME                      TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE 
controller                ClusterIP   10.233.14.11    <none>        7472/TCP   184d
metallb-webhook-service   ClusterIP   10.233.49.167   <none>        443/TCP    184d
root@control1:~# curl -s 10.233.14.11:7472/metrics | grep ^metal                   
metallb_allocator_addresses_in_use_total{pool="default"} 2                         
metallb_allocator_addresses_total{pool="default"} 2                                
metallb_k8s_client_config_loaded_bool 1                                            
metallb_k8s_client_config_stale_bool 0                                             
metallb_k8s_client_update_errors_total 3                                           
metallb_k8s_client_updates_total 14520                                             
root@control1:~#                                                                   

root@control1:~# kubectl -n metallb-system get IPAddressPool default -o json | jq -r '.spec'
{                                                                                           
  "addresses": [                                                                            
    "194.28.98.219-194.28.98.220"                                                           
  ],                                                                                        
  "autoAssign": true,                                                                       
  "avoidBuggyIPs": false                                                                    
}                                                                                           

- based on

root@control1:~# cat provider/metallb-config.yaml 
---                                               
apiVersion: metallb.io/v1beta1                    
kind: IPAddressPool                               
metadata:                                         
  name: default                                   
  namespace: metallb-system                       
spec:                                             
  addresses:                                      
  - 194.28.98.218-194.28.98.220                   
  autoAssign: true                                
  avoidBuggyIPs: false                            
---                                               
apiVersion: metallb.io/v1beta1                    
kind: L2Advertisement                             
metadata:                                         
  creationTimestamp: null                         
  name: l2advertisement1                          
  namespace: metallb-system                       
spec:                                             
  ipAddressPools:                                 
  - default                                       
andy108369 commented 8 months ago

Potentially related to the unattended upgrades previously enabled on the provider / or the networking related issue which have also been observed.

Gonna wrap https://github.com/akash-network/support/issues/129#issuecomment-1738943519 into a form of the troubleshooting doc.

brewsterdrinkwater commented 8 months ago

This issue is solved. Andrey will add documentation for this.

brewsterdrinkwater commented 8 months ago

oct 31: