Closed andy108369 closed 2 months ago
As per Maxime Cyr, he said he is still seeing allocated value change on provider.yvr.nb.akash.pub
even if there is no reservations. He said he could confirm ~15m ago that grpc endpoint returned 8 allocated gpu with 0 pending/active reservations.
Screenshot from him:
provider.yvr.nb.akash.pub
providerprovider.yvr.nb.akash.pub.provider.log
you could see in the provider logs cat provider.yvr.nb.akash.pub.provider.log | grep -E 'reservation requested|Reservation fulfilled|is not whitelisted'
-- there are many instances where provider would fulfill the request but the bid price script would reject it since the address isn't whitelisted. Might be that it contributes to the issue?
provider.yvr.nb.akash.pub.grpc.2.log
Most recent bids which went to the chain (Descending order):
$ akash query txs --events "message.sender=akash1gvk0778ytqlfamufh4h0u72lf0l4ed87prj07v&message.action=/akash.market.v1beta4.MsgCreateBid" -o json --page 1 --limit 100 | jq -r '.txs[] | [.timestamp, .txhash, (.tx | (.body.messages[] | (.resources_offer[] | (.resources.gpu.units.val, .resources.gpu.attributes[0].key))))] | @csv'
...
"2024-03-23T17:22:55Z","5E5E97613F9FDB344A174E6DA9A6300444B07AD79EDFAC2EBEA9AE74F37B0F4D","8","vendor/nvidia/model/h100"
"2024-03-25T02:43:00Z","85D3CCD2F79F66C52683505328DBA4D4F1A5310FFCDBF43005DB7B8DA57FEBA9","8","vendor/nvidia/model/h100"
"2024-03-25T03:18:30Z","326309C190983A905C123CC69BE6243D22ACCAAD6A94D50AB017B3A55E0E1F88","8","vendor/nvidia/model/h100"
"2024-03-25T13:20:37Z","1DB848338EAE25760CAB0A84273D31FCF029A35F96A0BDDE252B904E573E3ED9","8","vendor/nvidia/model/h100"
"2024-03-25T13:42:06Z","28B807343820647B85B6F82F6022225060738660468CE5956F302F0C1469AEC6","8","vendor/nvidia/model/h100"
"2024-03-25T16:40:01Z","EC35820B99B7BD6EBB99F5E5DE94DCFF5B0AD9A3BCD6CEE49027D7A3D0343B7F","8","vendor/nvidia/model/h100"
"2024-03-25T19:47:13Z","782C72A7D7F44694F3E277D9B0349D5E0F876D05CF34AACFDBCC4A6559AC2765","8","vendor/nvidia/model/h100"
"2024-03-25T20:11:51Z","29B45CDBF74C8A08AF2B22C4FD133144C580F4D3C0047166D32CEC4BF2241CBA","1","vendor/nvidia/model/h100"
Additional info:
root@sysadmin:~# kubectl get pods -A --sort-by='{.metadata.creationTimestamp}' -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system kube-proxy-hf8gs 1/1 Running 3 (20d ago) 22d 104.255.9.232 sysadmin <none> <none>
kube-system coredns-5d78c9869d-bw2rv 1/1 Running 3 (20d ago) 22d 10.232.50.120 sysadmin <none> <none>
kube-system coredns-5d78c9869d-9jvgj 1/1 Running 3 (20d ago) 22d 10.232.50.122 sysadmin <none> <none>
tigera-operator tigera-operator-59c686f986-c8hbw 1/1 Running 6 (20d ago) 22d 104.255.9.232 sysadmin <none> <none>
calico-system calico-node-hj5d9 1/1 Running 3 (20d ago) 22d 104.255.9.232 sysadmin <none> <none>
calico-system calico-typha-58c5886567-6ltwg 1/1 Running 3 (20d ago) 22d 104.255.9.232 sysadmin <none> <none>
calico-system calico-kube-controllers-b65f76557-hvnht 1/1 Running 3 (20d ago) 22d 10.232.50.119 sysadmin <none> <none>
calico-system csi-node-driver-gg72p 2/2 Running 6 (20d ago) 22d 10.232.50.125 sysadmin <none> <none>
calico-apiserver calico-apiserver-86dcb89c95-h2h6j 1/1 Running 3 (20d ago) 22d 10.232.50.118 sysadmin <none> <none>
calico-apiserver calico-apiserver-86dcb89c95-swsp5 1/1 Running 3 (20d ago) 22d 10.232.50.123 sysadmin <none> <none>
ingress-nginx ingress-nginx-controller-jn2rn 1/1 Running 3 (20d ago) 22d 10.232.50.121 sysadmin <none> <none>
kube-system etcd-sysadmin 1/1 Running 4 (20d ago) 21d 104.255.9.232 sysadmin <none> <none>
kube-system kube-apiserver-sysadmin 1/1 Running 4 (20d ago) 21d 104.255.9.232 sysadmin <none> <none>
kube-system kube-scheduler-sysadmin 1/1 Running 3 (20d ago) 21d 104.255.9.232 sysadmin <none> <none>
kube-system kube-controller-manager-sysadmin 1/1 Running 3 (20d ago) 21d 104.255.9.232 sysadmin <none> <none>
nvidia-device-plugin nvdp-nvidia-device-plugin-scrft 1/1 Running 0 20d 10.232.50.131 sysadmin <none> <none>
akash-services akash-node-1-0 1/1 Running 0 7d14h 10.232.50.146 sysadmin <none> <none>
akash-services operator-hostname-54674bf9d4-zmz77 1/1 Running 0 7d14h 10.232.50.148 sysadmin <none> <none>
akash-services operator-inventory-85d7f6bb6d-hvmvs 1/1 Running 0 36h 10.232.50.160 sysadmin <none> <none>
akash-services operator-inventory-hardware-discovery-sysadmin 1/1 Running 0 36h 10.232.50.161 sysadmin <none> <none>
akash-services akash-provider-0 1/1 Running 0 17h 10.232.50.164 sysadmin <none> <none>
root@sysadmin:~# helm list -A
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
akash-hostname-operator akash-services 2 2024-03-18 23:36:14.431790756 +0000 UTC deployed akash-hostname-operator-9.0.5 0.5.4
akash-node akash-services 2 2024-03-18 23:18:32.575638343 +0000 UTC deployed akash-node-9.0.2 0.32.2
akash-provider akash-services 4 2024-03-25 20:07:32.884746516 +0000 UTC deployed provider-9.1.2 0.5.4
ingress-nginx ingress-nginx 1 2024-03-04 04:16:52.531639998 +0000 UTC deployed ingress-nginx-4.9.1 1.9.6
inventory-operator akash-services 2 2024-03-18 23:37:05.473748953 +0000 UTC deployed akash-inventory-operator-9.0.6 0.5.4
nvdp nvidia-device-plugin 1 2024-03-06 03:26:43.729706422 +0000 UTC deployed nvidia-device-plugin-0.14.5 0.14.5
provider.yvr.nb.akash.pub.grpc.1.log
@troian cc @baktun14
Hey, yes it still happens, it usually takes about 15-30 minutes to release the resources. It started happening after I upgraded my provider to discovery feature version (0.5.4?). On previous versions it was working fine. I thought it maybe was public RPC node fault, but it turns out when I setup my own local rpc node it's still happening
The provider 0.5.12
does not exhibit the issue of excessive resource reporting and it appears it has significantly decreased the time it locks up the resources.
I've tested a deployment (broadcasted & closed it without accepting the bid), provider locked the resources for no longer than a minute.
Next steps:
0.5.12
gets applied to the most of the providers, test them to see if issue still occursHey, so I updated to 0.5.12 and so far it looks better already. The biggest delay I waited for refreshing values was about 5 minutes. Before it was like 30 minutes. I will keep monitoring this
yeah, can confirm. 0.5.13 - releases GPU resources after like a minute :-) :rocket:
Provider does not immediately release the resources after never accepted (no lease created) deployment order request closes. I.e. provider "locks up" the resources for pending deployment for 15-25 minutes and not releasing them even after deployment was closed (and was never even accepted).
akash network 0.30.0 provider 0.5.4
Reproducer*
8443/status
info from the provider (bef
file)8443/status
info from the provider (aft
file)Result/Issue
Provider doesn't release the resources (
Allocatable
, also in gRPC output over 8444/tcp) after deployment order was closed.Impact
Because of this providers may not be always able to immediately bid on deployment order requests, as when the requested amount of resources is larger than the amount of "locked" resources. ("locked" because they aren't technically locked when deployment is closed.) Additionally, if someone will constantly broadcast multiple deployment order requests without accepting them, say every 15 minutes - all of the providers will be "locked".
Expected result
Provider should release the resources when querying it over
8443/status
or8444
(grpc) endpoints immediately after the bid was closed (or at least within few seconds; instead of 15-25 minutes).Additional info
I've also noticed that sometimes provider would immediately release the resources if provider did not bid and while the deployment order request was still open.
Clue 1 / dseq
15400544
/unknown lease for bid
could be the cause?mon.obl.akash.pub provider
past 6 minutes - resources are still "locked":
Deployment: dseq
15400544
(11 cpu, 11Gi mem, 111Gi ephemeral disk) was created, bid upon, never accepted & closed shortly after it was created.Looks like this could be related to this error:
Provider logs
mon.obl.provider.log