Add documentation for provider metalLB troubleshooting

I[2023-09-28|09:42:54.027] order detected module=bidengine-service cmp=provider order=order/akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13002070/1/1 I[2023-09-28|09:42:54.029] group fetched module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13002070/1/1 I[2023-09-28|09:42:54.029] requesting reservation module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13002070/1/1

Additional notes

RPC looks good, provider withdraws from the leases, detects the order request, but doesn't go further (not checking whether it has enough resources to host the deployment, nor submits the bid).

cmp=inventory-service fell off?

The next messages such as reservation requested & reservation count, should be coming from the cmp=inventory-service, but there are none.

Here is the expected workflow (acc. to the logs) I took from the Hurricane provider:

Example:

$ grep -C30 -E 13001692 hurricane-provider.log | grep -Ev 'check|ip'
D[2023-09-28|09:05:32.917] cluster resources dump={"nodes":[{"name":"worker-01.hurricane2","allocatable":{"cpu":102000,"gpu":1,"memory":210936590336,"storage_ephemeral":1942146261054},"available":{"cpu":37395,"gpu":0,"memory":18985748480,"storage_ephemeral":1520702595134}}],"total_allocatable":{"cpu":102000,"gpu":1,"memory":210936590336,"storage_ephemeral":1942146261054,"storage":{"beta3":838743359488}},"total_available":{"cpu":37395,"gpu":0,"memory":18985748480,"storage_ephemeral":1520702595134,"storage":{"beta3":782640832404}}} module=provider-cluster cmp=provider cmp=service cmp=inventory-service
I[2023-09-28|09:05:34.077] order detected                               module=bidengine-service cmp=provider order=order/akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13001692/1/1
I[2023-09-28|09:05:34.080] group fetched                                module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13001692/1/1
I[2023-09-28|09:05:34.080] requesting reservation                       module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13001692/1/1
D[2023-09-28|09:05:34.080] reservation requested                        module=provider-cluster cmp=provider cmp=service cmp=inventory-service order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13001692/1/1 resources[{resource:{id:1,cpu:{units:{val:1000}},memory:{size:{val:4294967296}},storage:[{name:default,size:{val:21474836480}}],gpu:{units:{val:0}},endpoints:[{sequence_number:0}]},count:1,price:{denom:uakt,amount:1000000.000000000000000000}}]=(MISSING)
D[2023-09-28|09:05:34.080] reservation count                            module=provider-cluster cmp=provider cmp=service cmp=inventory-service cnt=10
I[2023-09-28|09:05:34.080] Reservation fulfilled                        module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13001692/1/1
D[2023-09-28|09:05:34.764] submitting fulfillment                       module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13001692/1/1 price=15.313822000000000000uakt
I[2023-09-28|09:05:36.895] filtering pods                               cmp=provider client=kube labelSelector=
D[2023-09-28|09:05:37.852] cluster resources dump={"nodes":[{"name":"worker-01.hurricane2","allocatable":{"cpu":102000,"gpu":1,"memory":210936590336,"storage_ephemeral":1942146261054},"available":{"cpu":37395,"gpu":0,"memory":18985748480,"storage_ephemeral":1520702595134}}],"total_allocatable":{"cpu":102000,"gpu":1,"memory":210936590336,"storage_ephemeral":1942146261054,"storage":{"beta3":838713409536}},"total_available":{"cpu":37395,"gpu":0,"memory":18985748480,"storage_ephemeral":1520702595134,"storage":{"beta3":782610882452}}} module=provider-cluster cmp=provider cmp=service cmp=inventory-service
I[2023-09-28|09:05:40.160] bid complete                                 module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13001692/1/1

cluster resources dump message appears only once

Also, the messages such as cluster resources dump (also cmp=inventory-service component) would normally appear in the logs frequently (every few minutes) if looking at other clusters, but he had them only once (when provider-services started).

the europlots provider nodes

root@control1:~# kubectl get nodes                                                                                                                                                          │
NAME       STATUS   ROLES           AGE    VERSION                                                                                                                                          │
ceph1      Ready    <none>          350d   v1.24.1                                                                                                                                          │
control1   Ready    control-plane   350d   v1.24.1                                                                                                                                          │
node1      Ready    <none>          203d   v1.24.1                                                                                                                                          │
node3      Ready    <none>          350d   v1.24.1                                                                                                                                          │
node4      Ready    <none>          350d   v1.24.1                                                                                                                                          │
root@control1:~# kubectl get nodes --show-labels                                                                                                                                            │
NAME       STATUS   ROLES           AGE    VERSION   LABELS                                                                                                                                 │
ceph1      Ready    <none>          350d   v1.24.1   akash.network/storageclasses=beta3,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hos│
tname=ceph1,kubernetes.io/os=linux                                                                                                                                                          │
control1   Ready    control-plane   350d   v1.24.1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=control1,kubernetes.io/os=lin│
ux,node-role.kubernetes.io/control-plane=,node.kubernetes.io/exclude-from-external-load-balancers=                                                                                          │
node1      Ready    <none>          203d   v1.24.1   akash.network/capabilities.gpu.vendor.nvidia.model.3090=true,akash.network/storageclasses=beta3,allow-nvdp=true,beta.kubernetes.io/arch│
=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node1,kubernetes.io/os=linux                                                                             │
node3      Ready    <none>          350d   v1.24.1   akash.network/capabilities.gpu.vendor.nvidia.model.3090=true,akash.network/storageclasses=beta3,allow-nvdp=true,beta.kubernetes.io/arch│
=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node3,kubernetes.io/os=linux                                                                             │
node4      Ready    <none>          350d   v1.24.1   akash.network/capabilities.gpu.vendor.nvidia.model.3090=true,akash.network/storageclasses=beta3,allow-nvdp=true,beta.kubernetes.io/arch│
=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node4,kubernetes.io/os=linux                                                                             │
root@control1:~#                                                                                                                                                                            │
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

"Fixed" the provider bidding issue

"Fixed" since that's only a workaround, akash-provider should fail to deploy the app that can't get the leased IPs ; and it should also not get stuck because of that / self-recover

After investigating the Europlots provider, I found that there was a deployment (~20 hours ago) created, with 3 leased IPs, however it had never gotten the IP assigned. This was breaking the inventory-service in the akash-provider.

To resolve it, I've closed the bid of that deployment and then had to bounce akash-provider pod for it to start bidding again. Logs https://transfer.sh/xmB3HiBXTG/provider-is-bidding-again-dseq-13002853.log

Highlights

inventory-service breaks when the deployment service can't get IP assigned (IP leasing)
akash-provider doesn't self recover after that condition is fixed (after deployment is closed)

Logs from the investigation

the deployment that was causing the issue

root@control1:~# kubectl get events -A --sort-by='.metadata.creationTimestamp'   |grep -Ev 'scan-vuln|akash-provider|operator'                                                              │
NAMESPACE                                       LAST SEEN   TYPE      REASON             OBJECT                                           MESSAGE                                           │
fseibv1hr266elp9crb06v2s2nb4s96qq387nhv56fdgq   42m         Warning   AllocationFailed   service/app-ip-80-tcp                            Failed to allocate IP for "fseibv1hr266elp9crb06v2│
s2nb4s96qq387nhv56fdgq/app-ip-80-tcp": no available IPs                                                                                                                                     │
fseibv1hr266elp9crb06v2s2nb4s96qq387nhv56fdgq   42m         Warning   AllocationFailed   service/app-ip-443-tcp                           Failed to allocate IP for "fseibv1hr266elp9crb06v2│
s2nb4s96qq387nhv56fdgq/app-ip-443-tcp": no available IPs                                                                                                                                    │
fseibv1hr266elp9crb06v2s2nb4s96qq387nhv56fdgq   42m         Warning   AllocationFailed   service/app-ip-22-tcp                            Failed to allocate IP for "fseibv1hr266elp9crb06v2│
s2nb4s96qq387nhv56fdgq/app-ip-22-tcp": no available IPs

root@control1:~# kubectl -n lease get manifest fseibv1hr266elp9crb06v2s2nb4s96qq387nhv56fdgq
NAME                                            AGE
fseibv1hr266elp9crb06v2s2nb4s96qq387nhv56fdgq   20h
root@control1:~# kubectl get manifest -A --sort-by='.metadata.creationTimestamp'   | tail -2
lease       ne6ubpns0mj0kb6l268fotikc272h8eibilpknbehsgrm   22h
lease       fseibv1hr266elp9crb06v2s2nb4s96qq387nhv56fdgq   20h   <<<<<<<<<<<< THIS one

root@control1:~# kubectl -n lease get manifest fseibv1hr266elp9crb06v2s2nb4s96qq387nhv56fdgq -o yaml | grep -i -A5 ip
        ip: haproxy_ip_server-3
        port: 22
        proto: TCP
      - endpoint_sequence_number: 1
        external_port: 80
        global: true
--
        ip: haproxy_ip_server-3
        port: 80
        proto: TCP
      - endpoint_sequence_number: 1
        external_port: 443
        global: true
--
        ip: haproxy_ip_server-3
        port: 443
        proto: TCP
      image: ubuntu:latest
      name: app
      resources:
root@control1:~#

root@control1:~# kubectl -n fseibv1hr266elp9crb06v2s2nb4s96qq387nhv56fdgq get all
NAME                       READY   STATUS    RESTARTS   AGE
pod/app-794bbc75dc-gq54z   1/1     Running   0          20h

NAME                     TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
service/app              ClusterIP      10.233.19.168   <none>        80/TCP                       20h
service/app-ip-22-tcp    LoadBalancer   10.233.27.253   <pending>     22:31562/TCP                 20h
service/app-ip-443-tcp   LoadBalancer   10.233.27.174   <pending>     443:31410/TCP                20h
service/app-ip-80-tcp    LoadBalancer   10.233.40.120   <pending>     80:32591/TCP                 20h
service/app-np           NodePort       10.233.43.87    <none>        22:30649/TCP,443:30161/TCP   20h

NAME                  READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/app   1/1     1            1           20h

NAME                             DESIRED   CURRENT   READY   AGE
replicaset.apps/app-794bbc75dc   1         1         1       20h
root@control1:~#

root@control1:~# kubectl -n metallb-system logs metallb-controller-7f6b8b7fdd-v95lp --tail=100 | grep error
{"caller":"service.go:140","error":"no available IPs","level":"error","msg":"IP allocation failed","op":"allocateIPs","ts":"2023-09-28T08:19:36Z"}
{"caller":"service.go:140","error":"no available IPs","level":"error","msg":"IP allocation failed","op":"allocateIPs","ts":"2023-09-28T08:19:36Z"}
{"caller":"service.go:140","error":"no available IPs","level":"error","msg":"IP allocation failed","op":"allocateIPs","ts":"2023-09-28T09:18:23Z"}
{"caller":"service.go:140","error":"no available IPs","level":"error","msg":"IP allocation failed","op":"allocateIPs","ts":"2023-09-28T09:18:23Z"}
{"caller":"service.go:140","error":"no available IPs","level":"error","msg":"IP allocation failed","op":"allocateIPs","ts":"2023-09-28T09:18:23Z"}
root@control1:~#

FWIW, IPAddressPool reports 194.28.98.219-194.28.98.220 range, while the provider's metallb-config.yaml has 194.28.98.218-194.28.98.220

194.28.98.219 has already been allocated to some other app:

root@control1:~# kubectl get svc -A -l akash.network |grep LoadBalancer | column -t                                              
9a2cmcvit3r2bm1dn6vna8dm82dr887fm69atjdfcnp4c  app-ip-3333-tcp  LoadBalancer  10.233.18.185  194.28.98.219  3333:31042/TCP  24h  
9a2cmcvit3r2bm1dn6vna8dm82dr887fm69atjdfcnp4c  app-ip-8585-tcp  LoadBalancer  10.233.58.174  194.28.98.219  8585:31162/TCP  24h  
fm8d7v04vkef5iqu2mpfbt07dm1ilp5930rnj2umg08k8  app-ip-1414-tcp  LoadBalancer  10.233.58.8    194.28.98.220  1414:31711/TCP  3d23h
fm8d7v04vkef5iqu2mpfbt07dm1ilp5930rnj2umg08k8  app-ip-1515-tcp  LoadBalancer  10.233.44.173  194.28.98.220  1515:32491/TCP  3d23h
fseibv1hr266elp9crb06v2s2nb4s96qq387nhv56fdgq  app-ip-22-tcp    LoadBalancer  10.233.27.253  <pending>      22:31562/TCP    20h  
fseibv1hr266elp9crb06v2s2nb4s96qq387nhv56fdgq  app-ip-443-tcp   LoadBalancer  10.233.27.174  <pending>      443:31410/TCP   20h  
fseibv1hr266elp9crb06v2s2nb4s96qq387nhv56fdgq  app-ip-80-tcp    LoadBalancer  10.233.40.120  <pending>      80:32591/TCP    20h  
root@control1:~#

root@control1:~# kubectl -n metallb-system get pods                                
NAME                                  READY   STATUS    RESTARTS       AGE         
metallb-controller-7f6b8b7fdd-v95lp   1/1     Running   0              28d         
metallb-speaker-25s86                 1/1     Running   1 (140d ago)   184d        
metallb-speaker-dkhb4                 1/1     Running   8 (25d ago)    184d        
metallb-speaker-r4lrx                 1/1     Running   4 (25d ago)    184d        
metallb-speaker-tm5d5                 1/1     Running   15 (25d ago)   184d        
metallb-speaker-v5vh9                 1/1     Running   2 (70d ago)    184d        
root@control1:~# kubectl -n metallb-system get svc                                 
NAME                      TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE 
controller                ClusterIP   10.233.14.11    <none>        7472/TCP   184d
metallb-webhook-service   ClusterIP   10.233.49.167   <none>        443/TCP    184d
root@control1:~# curl -s 10.233.14.11:7472/metrics | grep ^metal                   
metallb_allocator_addresses_in_use_total{pool="default"} 2                         
metallb_allocator_addresses_total{pool="default"} 2                                
metallb_k8s_client_config_loaded_bool 1                                            
metallb_k8s_client_config_stale_bool 0                                             
metallb_k8s_client_update_errors_total 3                                           
metallb_k8s_client_updates_total 14520                                             
root@control1:~#                                                                   

root@control1:~# kubectl -n metallb-system get IPAddressPool default -o json | jq -r '.spec'
{                                                                                           
  "addresses": [                                                                            
    "194.28.98.219-194.28.98.220"                                                           
  ],                                                                                        
  "autoAssign": true,                                                                       
  "avoidBuggyIPs": false                                                                    
}                                                                                           

- based on

root@control1:~# cat provider/metallb-config.yaml 
---                                               
apiVersion: metallb.io/v1beta1                    
kind: IPAddressPool                               
metadata:                                         
  name: default                                   
  namespace: metallb-system                       
spec:                                             
  addresses:                                      
  - 194.28.98.218-194.28.98.220                   
  autoAssign: true                                
  avoidBuggyIPs: false                            
---                                               
apiVersion: metallb.io/v1beta1                    
kind: L2Advertisement                             
metadata:                                         
  creationTimestamp: null                         
  name: l2advertisement1                          
  namespace: metallb-system                       
spec:                                             
  ipAddressPools:                                 
  - default

akash-network / support