akash-network / support

Akash Support and Issue Tracking
5 stars 3 forks source link

provider should release resources immediately after never accepted deployment was closed #193

Closed andy108369 closed 2 months ago

andy108369 commented 3 months ago

Provider does not immediately release the resources after never accepted (no lease created) deployment order request closes. I.e. provider "locks up" the resources for pending deployment for 15-25 minutes and not releasing them even after deployment was closed (and was never even accepted).

akash network 0.30.0 provider 0.5.4

Reproducer*

  1. save the 8443/status info from the provider (bef file)
  2. broadcast deployment order request
  3. wait for the provider to bid on it (or actually it is not necessarily for the provider to bid on it; but provider should notice the deployment order request; the bid pricing script may reject to bid due to the client address not being whitelisted)
  4. close the deployment order request
  5. save the 8443/status info from the provider (aft file)
    • this reproducer doesn't always reproduce the issue

Result/Issue

Provider doesn't release the resources (Allocatable, also in gRPC output over 8444/tcp) after deployment order was closed.

image

script used: provider_info2.sh

Impact

Because of this providers may not be always able to immediately bid on deployment order requests, as when the requested amount of resources is larger than the amount of "locked" resources. ("locked" because they aren't technically locked when deployment is closed.) Additionally, if someone will constantly broadcast multiple deployment order requests without accepting them, say every 15 minutes - all of the providers will be "locked".

Expected result

Provider should release the resources when querying it over 8443/status or 8444 (grpc) endpoints immediately after the bid was closed (or at least within few seconds; instead of 15-25 minutes).

Additional info

I've also noticed that sometimes provider would immediately release the resources if provider did not bid and while the deployment order request was still open.


Clue 1 / dseq 15400544 / unknown lease for bid could be the cause?

mon.obl.akash.pub provider

image

past 6 minutes - resources are still "locked": image

Deployment: dseq 15400544 (11 cpu, 11Gi mem, 111Gi ephemeral disk) was created, bid upon, never accepted & closed shortly after it was created.

Looks like this could be related to this error:

Look for unknown lease for bid

$ grep 15400544 mon.obl.provider.log 
I[2024-03-12|17:26:08.913] order detected                               module=bidengine-service cmp=provider order=order/akash1sq46jmdrk0usarsnz8a7ulv5ysgqzl5vqmte7m/15400544/1/1
I[2024-03-12|17:26:08.920] group fetched                                module=bidengine-order cmp=provider order=akash1sq46jmdrk0usarsnz8a7ulv5ysgqzl5vqmte7m/15400544/1/1
I[2024-03-12|17:26:08.920] requesting reservation                       module=bidengine-order cmp=provider order=akash1sq46jmdrk0usarsnz8a7ulv5ysgqzl5vqmte7m/15400544/1/1
D[2024-03-12|17:26:08.920] reservation requested. order=akash1sq46jmdrk0usarsnz8a7ulv5ysgqzl5vqmte7m/15400544/1/1, resources=[{"resource":{"id":1,"cpu":{"units":{"val":"11000"}},"memory":{"size":{"val":"11811160064"}},"storage":[{"name":"default","size":{"val":"119185342464"}}],"gpu":{"units":{"val":"0"}},"endpoints":[{"kind":1,"sequence_number":0},{"sequence_number":0}]},"count":1,"price":{"denom":"uakt","amount":"1000000.000000000000000000"}}] module=provider-cluster cmp=provider cmp=service cmp=inventory-service
I[2024-03-12|17:26:08.920] Reservation fulfilled                        module=bidengine-order cmp=provider order=akash1sq46jmdrk0usarsnz8a7ulv5ysgqzl5vqmte7m/15400544/1/1
D[2024-03-12|17:26:10.241] submitting fulfillment                       module=bidengine-order cmp=provider order=akash1sq46jmdrk0usarsnz8a7ulv5ysgqzl5vqmte7m/15400544/1/1 price=9.038012000000000000uakt
I[2024-03-12|17:26:15.012] bid complete                                 module=bidengine-order cmp=provider order=akash1sq46jmdrk0usarsnz8a7ulv5ysgqzl5vqmte7m/15400544/1/1
I[2024-03-12|17:26:26.781] order closed                                 module=bidengine-order cmp=provider order=akash1sq46jmdrk0usarsnz8a7ulv5ysgqzl5vqmte7m/15400544/1/1
I[2024-03-12|17:26:26.781] shutting down                                module=bidengine-order cmp=provider order=akash1sq46jmdrk0usarsnz8a7ulv5ysgqzl5vqmte7m/15400544/1/1
D[2024-03-12|17:26:26.781] unreserving reservation                      module=bidengine-order cmp=provider order=akash1sq46jmdrk0usarsnz8a7ulv5ysgqzl5vqmte7m/15400544/1/1
D[2024-03-12|17:26:26.781] unreserving capacity                         module=provider-cluster cmp=provider cmp=service cmp=inventory-service order=akash1sq46jmdrk0usarsnz8a7ulv5ysgqzl5vqmte7m/15400544/1/1
I[2024-03-12|17:26:26.781] attempting to removing reservation           module=provider-cluster cmp=provider cmp=service cmp=inventory-service order=akash1sq46jmdrk0usarsnz8a7ulv5ysgqzl5vqmte7m/15400544/1/1
I[2024-03-12|17:26:26.781] removing reservation                         module=provider-cluster cmp=provider cmp=service cmp=inventory-service order=akash1sq46jmdrk0usarsnz8a7ulv5ysgqzl5vqmte7m/15400544/1/1
I[2024-03-12|17:26:26.781] unreserve capacity complete                  module=provider-cluster cmp=provider cmp=service cmp=inventory-service order=akash1sq46jmdrk0usarsnz8a7ulv5ysgqzl5vqmte7m/15400544/1/1
D[2024-03-12|17:26:26.782] closing bid                                  module=bidengine-order cmp=provider order=akash1sq46jmdrk0usarsnz8a7ulv5ysgqzl5vqmte7m/15400544/1/1 order-id=akash1sq46jmdrk0usarsnz8a7ulv5ysgqzl5vqmte7m/15400544/1/1
E[2024-03-12|17:26:26.787] closing bid                                  module=bidengine-order cmp=provider order=akash1sq46jmdrk0usarsnz8a7ulv5ysgqzl5vqmte7m/15400544/1/1 err="rpc error: code = Unknown desc = rpc error: code = Unknown desc = failed to execute message; message index: 0: unknown lease for bid [cosmos/cosmos-sdk@v0.45.16/baseapp/baseapp.go:781] With gas wanted: '0' and gas used: '49578' : unknown request"

Provider logs

mon.obl.provider.log

andy108369 commented 3 months ago

As per Maxime Cyr, he said he is still seeing allocated value change on provider.yvr.nb.akash.pub even if there is no reservations. He said he could confirm ~15m ago that grpc endpoint returned 8 allocated gpu with 0 pending/active reservations.

Screenshot from him: image

Logs from the provider.yvr.nb.akash.pub provider

Potential Clue 2 - rejected by the bid script after was fulfilled by the bid engine:

provider.yvr.nb.akash.pub.provider.log

you could see in the provider logs cat provider.yvr.nb.akash.pub.provider.log | grep -E 'reservation requested|Reservation fulfilled|is not whitelisted' -- there are many instances where provider would fulfill the request but the bid price script would reject it since the address isn't whitelisted. Might be that it contributes to the issue?

provider.yvr.nb.akash.pub.grpc.2.log

Most recent bids which went to the chain (Descending order):

$ akash query txs --events "message.sender=akash1gvk0778ytqlfamufh4h0u72lf0l4ed87prj07v&message.action=/akash.market.v1beta4.MsgCreateBid" -o json --page 1 --limit 100 | jq -r '.txs[] | [.timestamp, .txhash, (.tx | (.body.messages[] | (.resources_offer[] | (.resources.gpu.units.val, .resources.gpu.attributes[0].key))))] | @csv'
...
"2024-03-23T17:22:55Z","5E5E97613F9FDB344A174E6DA9A6300444B07AD79EDFAC2EBEA9AE74F37B0F4D","8","vendor/nvidia/model/h100"
"2024-03-25T02:43:00Z","85D3CCD2F79F66C52683505328DBA4D4F1A5310FFCDBF43005DB7B8DA57FEBA9","8","vendor/nvidia/model/h100"
"2024-03-25T03:18:30Z","326309C190983A905C123CC69BE6243D22ACCAAD6A94D50AB017B3A55E0E1F88","8","vendor/nvidia/model/h100"
"2024-03-25T13:20:37Z","1DB848338EAE25760CAB0A84273D31FCF029A35F96A0BDDE252B904E573E3ED9","8","vendor/nvidia/model/h100"
"2024-03-25T13:42:06Z","28B807343820647B85B6F82F6022225060738660468CE5956F302F0C1469AEC6","8","vendor/nvidia/model/h100"
"2024-03-25T16:40:01Z","EC35820B99B7BD6EBB99F5E5DE94DCFF5B0AD9A3BCD6CEE49027D7A3D0343B7F","8","vendor/nvidia/model/h100"
"2024-03-25T19:47:13Z","782C72A7D7F44694F3E277D9B0349D5E0F876D05CF34AACFDBCC4A6559AC2765","8","vendor/nvidia/model/h100"
"2024-03-25T20:11:51Z","29B45CDBF74C8A08AF2B22C4FD133144C580F4D3C0047166D32CEC4BF2241CBA","1","vendor/nvidia/model/h100"

Additional info:

root@sysadmin:~# kubectl get pods -A --sort-by='{.metadata.creationTimestamp}' -o wide 
NAMESPACE              NAME                                             READY   STATUS    RESTARTS      AGE     IP              NODE       NOMINATED NODE   READINESS GATES
kube-system            kube-proxy-hf8gs                                 1/1     Running   3 (20d ago)   22d     104.255.9.232   sysadmin   <none>           <none>
kube-system            coredns-5d78c9869d-bw2rv                         1/1     Running   3 (20d ago)   22d     10.232.50.120   sysadmin   <none>           <none>
kube-system            coredns-5d78c9869d-9jvgj                         1/1     Running   3 (20d ago)   22d     10.232.50.122   sysadmin   <none>           <none>
tigera-operator        tigera-operator-59c686f986-c8hbw                 1/1     Running   6 (20d ago)   22d     104.255.9.232   sysadmin   <none>           <none>
calico-system          calico-node-hj5d9                                1/1     Running   3 (20d ago)   22d     104.255.9.232   sysadmin   <none>           <none>
calico-system          calico-typha-58c5886567-6ltwg                    1/1     Running   3 (20d ago)   22d     104.255.9.232   sysadmin   <none>           <none>
calico-system          calico-kube-controllers-b65f76557-hvnht          1/1     Running   3 (20d ago)   22d     10.232.50.119   sysadmin   <none>           <none>
calico-system          csi-node-driver-gg72p                            2/2     Running   6 (20d ago)   22d     10.232.50.125   sysadmin   <none>           <none>
calico-apiserver       calico-apiserver-86dcb89c95-h2h6j                1/1     Running   3 (20d ago)   22d     10.232.50.118   sysadmin   <none>           <none>
calico-apiserver       calico-apiserver-86dcb89c95-swsp5                1/1     Running   3 (20d ago)   22d     10.232.50.123   sysadmin   <none>           <none>
ingress-nginx          ingress-nginx-controller-jn2rn                   1/1     Running   3 (20d ago)   22d     10.232.50.121   sysadmin   <none>           <none>
kube-system            etcd-sysadmin                                    1/1     Running   4 (20d ago)   21d     104.255.9.232   sysadmin   <none>           <none>
kube-system            kube-apiserver-sysadmin                          1/1     Running   4 (20d ago)   21d     104.255.9.232   sysadmin   <none>           <none>
kube-system            kube-scheduler-sysadmin                          1/1     Running   3 (20d ago)   21d     104.255.9.232   sysadmin   <none>           <none>
kube-system            kube-controller-manager-sysadmin                 1/1     Running   3 (20d ago)   21d     104.255.9.232   sysadmin   <none>           <none>
nvidia-device-plugin   nvdp-nvidia-device-plugin-scrft                  1/1     Running   0             20d     10.232.50.131   sysadmin   <none>           <none>
akash-services         akash-node-1-0                                   1/1     Running   0             7d14h   10.232.50.146   sysadmin   <none>           <none>
akash-services         operator-hostname-54674bf9d4-zmz77               1/1     Running   0             7d14h   10.232.50.148   sysadmin   <none>           <none>
akash-services         operator-inventory-85d7f6bb6d-hvmvs              1/1     Running   0             36h     10.232.50.160   sysadmin   <none>           <none>
akash-services         operator-inventory-hardware-discovery-sysadmin   1/1     Running   0             36h     10.232.50.161   sysadmin   <none>           <none>
akash-services         akash-provider-0                                 1/1     Running   0             17h     10.232.50.164   sysadmin   <none>           <none>

root@sysadmin:~# helm list -A
NAME                    NAMESPACE               REVISION    UPDATED                                 STATUS      CHART                           APP VERSION
akash-hostname-operator akash-services          2           2024-03-18 23:36:14.431790756 +0000 UTC deployed    akash-hostname-operator-9.0.5   0.5.4      
akash-node              akash-services          2           2024-03-18 23:18:32.575638343 +0000 UTC deployed    akash-node-9.0.2                0.32.2     
akash-provider          akash-services          4           2024-03-25 20:07:32.884746516 +0000 UTC deployed    provider-9.1.2                  0.5.4      
ingress-nginx           ingress-nginx           1           2024-03-04 04:16:52.531639998 +0000 UTC deployed    ingress-nginx-4.9.1             1.9.6      
inventory-operator      akash-services          2           2024-03-18 23:37:05.473748953 +0000 UTC deployed    akash-inventory-operator-9.0.6  0.5.4      
nvdp                    nvidia-device-plugin    1           2024-03-06 03:26:43.729706422 +0000 UTC deployed    nvidia-device-plugin-0.14.5     0.14.5     

provider.yvr.nb.akash.pub.grpc.1.log

@troian cc @baktun14

TormenTeDx commented 3 months ago

Hey, yes it still happens, it usually takes about 15-30 minutes to release the resources. It started happening after I upgraded my provider to discovery feature version (0.5.4?). On previous versions it was working fine. I thought it maybe was public RPC node fault, but it turns out when I setup my own local rpc node it's still happening

andy108369 commented 2 months ago

The provider 0.5.12 does not exhibit the issue of excessive resource reporting and it appears it has significantly decreased the time it locks up the resources. I've tested a deployment (broadcasted & closed it without accepting the bid), provider locked the resources for no longer than a minute.

Next steps:

TormenTeDx commented 2 months ago

Hey, so I updated to 0.5.12 and so far it looks better already. The biggest delay I waited for refreshing values was about 5 minutes. Before it was like 30 minutes. I will keep monitoring this

andy108369 commented 2 months ago

yeah, can confirm. 0.5.13 - releases GPU resources after like a minute :-) :rocket: