akash-network / support

Akash Support and Issue Tracking
5 stars 3 forks source link

Dangling Deployments on Testnet Provider #102

Closed chainzero closed 1 year ago

chainzero commented 1 year ago

Provider reports GPU deployments remain active on provider following deployment and lease closure.

Affected provider version - 0.3.1-rc0

Details of an example dangling deployment:

Deployments

root@node1:~# kubectl get pods -A
NAMESPACE                                       NAME                                       READY   STATUS    RESTARTS       AGE
2n0eiafgd0jkuhjm5s6mlbktfnjvdde12abarm32vqv7g   app-794f6cd9d6-qx696                       1/1     Running   1 (89m ago)    93m
3s9v4cj92jc2i8uhhv78nvc70gigoauvolk8prs73cql8   app-578d77f468-9pn2z                       1/1     Running   1 (89m ago)
root@node1:~# kubectl get ns --show-labels
NAME                                            STATUS   AGE     LABELS
2n0eiafgd0jkuhjm5s6mlbktfnjvdde12abarm32vqv7g   Active   16h     akash.network/lease.id.dseq=26084,akash.network/lease.id.gseq=1,akash.network/lease.id.oseq=1,akash.network/lease.id.owner=akash1eedv523xngwrx7sy7dznmslp9c75qe928477et,akash.network/lease.id.provider=akash143ypn84kuf379tv9wvcxsmamhj83d5pg2rfc8v,akash.network/namespace=2n0eiafgd0jkuhjm5s6mlbktfnjvdde12abarm32vqv7g,akash.network=true,kubernetes.io/metadata.name=2n0eiafgd0jkuhjm5s6mlbktfnjvdde12abarm32vqv7g
3s9v4cj92jc2i8uhhv78nvc70gigoauvolk8prs73cql8   Active   16h     akash.network/lease.id.dseq=26062,akash.network/lease.id.gseq=1,akash.network/lease.id.oseq=1,akash.network/lease.id.owner=akash1eedv523xngwrx7sy7dznmslp9c75qe928477et,akash.network/lease.id.provider=akash143ypn84kuf379tv9wvcxsmamhj83d5pg2rfc8v,akash.network/namespace=3s9v4cj92jc2i8uhhv78nvc70gigoauvolk8prs73cql8,akash.network=true,kubernetes.io/metadata.name=3s9v4cj92jc2i8uhhv78nvc70gigoauvolk8prs73cql8

On Chain Lease Status of Example Dangling Deployment - Closed State

root@ip-172-31-25-67:~# provider-services query market lease get --dseq 26084 --owner akash1eedv523xngwrx7sy7dznmslp9c75qe928477et --provider akash143ypn84kuf379tv9wvcxsmamhj83d5pg2rfc8v
escrow_payment:
  account_id:
    scope: deployment
    xid: akash1eedv523xngwrx7sy7dznmslp9c75qe928477et/26084
  balance:
    amount: "0.000000000000000000"
    denom: uakt
  owner: akash143ypn84kuf379tv9wvcxsmamhj83d5pg2rfc8v
  payment_id: 1/1/akash143ypn84kuf379tv9wvcxsmamhj83d5pg2rfc8v
  rate:
    amount: "364.000000000000000000"
    denom: uakt
  state: closed
  withdrawn:
    amount: "1591772"
    denom: uakt
lease:
  closed_on: "30470"
  created_at: "26097"
  lease_id:
    dseq: "26084"
    gseq: 1
    oseq: 1
    owner: akash1eedv523xngwrx7sy7dznmslp9c75qe928477et
    provider: akash143ypn84kuf379tv9wvcxsmamhj83d5pg2rfc8v
  price:
    amount: "364.000000000000000000"
    denom: uakt
  state: closed

On Chain Deployment Status of Example Dangling Deployment - Closed State

  root@ip-172-31-25-67:~# provider-services query deployment get --dseq 26084 --owner akash1eedv523xngwrx7sy7dznmslp9c75qe928477et
deployment:
  created_at: "26086"
  deployment_id:
    dseq: "26084"
    owner: akash1eedv523xngwrx7sy7dznmslp9c75qe928477et
  state: closed
  version: 3jD/Ny+7QXgWavV54P+j9mdd8X+kJKpxHx4xC9+iscI=

If the Dangling Deployments script is run it appears the provider had an open lease for example deployments (misalignment with on chain state) and the script closes the namespace/artifacts:

  root@node1:~# bash +x dangling.sh
kubectl delete ns 2n0eiafgd0jkuhjm5s6mlbktfnjvdde12abarm32vqv7g --wait=false
kubectl -n lease delete providerhosts --selector=akash.network/lease.id.owner=akash1eedv523xngwrx7sy7dznmslp9c75qe928477et,akash.network/lease.id.dseq=26084,akash.network/lease.id.gseq=1,akash.network/lease.id.oseq=1 --wait=false
kubectl delete ns 3s9v4cj92jc2i8uhhv78nvc70gigoauvolk8prs73cql8 --wait=false
kubectl -n lease delete providerhosts --selector=akash.network/lease.id.owner=akash1eedv523xngwrx7sy7dznmslp9c75qe928477et,akash.network/lease.id.dseq=26062,akash.network/lease.id.gseq=1,akash.network/lease.id.oseq=1 --wait=false
Defaulted container "provider" out of: provider, init (init)

However the deployments remain active after cleanse. Manual deletion of the related deployments in Kubernetes removes instances (K8s deployments and associated pods) but need to determine why such deployments were dangling and why dangling deployment script had no impact.

andy108369 commented 1 year ago

FWIW: the kubectl delete commands must be manually ran after running the bash +x dangling.sh.

andy108369 commented 1 year ago

I haven't seen this issue on our provider. Can't reproduce it so far... Maybe that's related to the fact we reset the GPU testnet-02 on Monday...? Though his deployment corresponds with the block in the new chain after the rest :man_shrugging: