akash-network / support

Akash Support and Issue Tracking
5 stars 4 forks source link

BUG: akash provider lease-shell stops working when pod gets restarted due to eviction #42

Closed arno01 closed 1 year ago

arno01 commented 2 years ago

Reproducer

  1. deploy something with 1 or 10 GB storage request (== limit);

  2. consume more than the limit in step 1;

    root@ssh-6fd4f4bdf9-7b2p2:/# dd if=/dev/zero of=/test-count-2048 bs=10M count=2048
    2048+0 records in
    2048+0 records out
    21474836480 bytes (21 GB, 20 GiB) copied, 13.8965 s, 1.5 GB/s
    root@ssh-6fd4f4bdf9-7b2p2:/# Error: lease shell failed: remote process exited with code 137
  3. this will cause the pod to restart due to:

    "reason": "Evicted",
    "note": "Container ssh exceeded its local ephemeral storage limit \"10737418240\". ",

    See the entire akash provider lease-events log below.

  4. akash provider lease-shell stops working;

    $ akash provider lease-shell --tty --stdin -- ssh bash
    Error: lease shell failed: remote command execute error: service with that name is not running: the service has failed

version 0.16.4-rc0

akash provider & client are of 0.16.4-rc0 version

$ curl -sk "https://provider.europlots.com:8443/version" | jq -r
{
  "akash": {
    "version": "v0.16.4-rc0",
    "commit": "38b82258c14e3d0a2ed3d15a8d4140ec8c826a84",
    "buildTags": "\"osusergo,netgo,ledger,static_build\"",
    "go": "go version go1.17.6 linux/amd64",
    "cosmosSdkVersion": "v0.45.1"
  },
  "kube": {
    "major": "1",
    "minor": "23",
    "gitVersion": "v1.23.5",
    "gitCommit": "c285e781331a3785a7f436042c65c5641ce8a9e9",
    "gitTreeState": "clean",
    "buildDate": "2022-03-16T15:52:18Z",
    "goVersion": "go1.17.8",
    "compiler": "gc",
    "platform": "linux/amd64"
  }
}

lease-status after the eviction

$ akash provider lease-status
{
  "services": {
    "ssh": {
      "name": "ssh",
      "available": 1,
      "total": 1,
      "uris": [
        "31ai266lqddovfbslrlj1vtcfk.ingress.europlots.com"
      ],
      "observed_generation": 1,
      "replicas": 1,
      "updated_replicas": 1,
      "ready_replicas": 1,
      "available_replicas": 1
    }
  },
  "forwarded_ports": {
    "ssh": [
      {
        "host": "ingress.europlots.com",
        "port": 22,
        "externalPort": 32459,
        "proto": "TCP",
        "available": 1,
        "name": "ssh"
      }
    ]
  }
}

lease-events logs

$ akash provider lease-events 
{
  "type": "Normal",
  "reason": "Sync",
  "note": "Scheduled for sync",
  "object": {
    "kind": "Ingress",
    "namespace": "vpjq3g0uoce5ffa9j85h74t9skosfj92dp4ce7eamhsdg",
    "name": "31ai266lqddovfbslrlj1vtcfk.ingress.europlots.com"
  },
  "lease_id": {
    "owner": "akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h",
    "dseq": 5673203,
    "gseq": 1,
    "oseq": 1,
    "provider": "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"
  }
}
{
  "type": "Normal",
  "reason": "Scheduled",
  "note": "Successfully assigned vpjq3g0uoce5ffa9j85h74t9skosfj92dp4ce7eamhsdg/ssh-6fd4f4bdf9-7b2p2 to node2",
  "object": {
    "kind": "Pod",
    "namespace": "vpjq3g0uoce5ffa9j85h74t9skosfj92dp4ce7eamhsdg",
    "name": "ssh-6fd4f4bdf9-7b2p2"
  },
  "lease_id": {
    "owner": "akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h",
    "dseq": 5673203,
    "gseq": 1,
    "oseq": 1,
    "provider": "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"
  }
}
{
  "type": "Normal",
  "reason": "Pulled",
  "note": "Container image \"ubuntu:21.10\" already present on machine",
  "object": {
    "kind": "Pod",
    "namespace": "vpjq3g0uoce5ffa9j85h74t9skosfj92dp4ce7eamhsdg",
    "name": "ssh-6fd4f4bdf9-7b2p2"
  },
  "lease_id": {
    "owner": "akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h",
    "dseq": 5673203,
    "gseq": 1,
    "oseq": 1,
    "provider": "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"
  }
}
{
  "type": "Normal",
  "reason": "Created",
  "note": "Created container ssh",
  "object": {
    "kind": "Pod",
    "namespace": "vpjq3g0uoce5ffa9j85h74t9skosfj92dp4ce7eamhsdg",
    "name": "ssh-6fd4f4bdf9-7b2p2"
  },
  "lease_id": {
    "owner": "akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h",
    "dseq": 5673203,
    "gseq": 1,
    "oseq": 1,
    "provider": "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"
  }
}
{
  "type": "Normal",
  "reason": "Started",
  "note": "Started container ssh",
  "object": {
    "kind": "Pod",
    "namespace": "vpjq3g0uoce5ffa9j85h74t9skosfj92dp4ce7eamhsdg",
    "name": "ssh-6fd4f4bdf9-7b2p2"
  },
  "lease_id": {
    "owner": "akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h",
    "dseq": 5673203,
    "gseq": 1,
    "oseq": 1,
    "provider": "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"
  }
}
{
  "type": "Warning",
  "reason": "Evicted",
  "note": "Container ssh exceeded its local ephemeral storage limit \"10737418240\". ",
  "object": {
    "kind": "Pod",
    "namespace": "vpjq3g0uoce5ffa9j85h74t9skosfj92dp4ce7eamhsdg",
    "name": "ssh-6fd4f4bdf9-7b2p2"
  },
  "lease_id": {
    "owner": "akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h",
    "dseq": 5673203,
    "gseq": 1,
    "oseq": 1,
    "provider": "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"
  }
}
{
  "type": "Normal",
  "reason": "Killing",
  "note": "Stopping container ssh",
  "object": {
    "kind": "Pod",
    "namespace": "vpjq3g0uoce5ffa9j85h74t9skosfj92dp4ce7eamhsdg",
    "name": "ssh-6fd4f4bdf9-7b2p2"
  },
  "lease_id": {
    "owner": "akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h",
    "dseq": 5673203,
    "gseq": 1,
    "oseq": 1,
    "provider": "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"
  }
}
{
  "type": "Normal",
  "reason": "Scheduled",
  "note": "Successfully assigned vpjq3g0uoce5ffa9j85h74t9skosfj92dp4ce7eamhsdg/ssh-6fd4f4bdf9-fwn5g to node2",
  "object": {
    "kind": "Pod",
    "namespace": "vpjq3g0uoce5ffa9j85h74t9skosfj92dp4ce7eamhsdg",
    "name": "ssh-6fd4f4bdf9-fwn5g"
  },
  "lease_id": {
    "owner": "akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h",
    "dseq": 5673203,
    "gseq": 1,
    "oseq": 1,
    "provider": "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"
  }
}
{
  "type": "Normal",
  "reason": "Pulled",
  "note": "Container image \"ubuntu:21.10\" already present on machine",
  "object": {
    "kind": "Pod",
    "namespace": "vpjq3g0uoce5ffa9j85h74t9skosfj92dp4ce7eamhsdg",
    "name": "ssh-6fd4f4bdf9-fwn5g"
  },
  "lease_id": {
    "owner": "akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h",
    "dseq": 5673203,
    "gseq": 1,
    "oseq": 1,
    "provider": "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"
  }
}
{
  "type": "Normal",
  "reason": "Created",
  "note": "Created container ssh",
  "object": {
    "kind": "Pod",
    "namespace": "vpjq3g0uoce5ffa9j85h74t9skosfj92dp4ce7eamhsdg",
    "name": "ssh-6fd4f4bdf9-fwn5g"
  },
  "lease_id": {
    "owner": "akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h",
    "dseq": 5673203,
    "gseq": 1,
    "oseq": 1,
    "provider": "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"
  }
}
{
  "type": "Normal",
  "reason": "Started",
  "note": "Started container ssh",
  "object": {
    "kind": "Pod",
    "namespace": "vpjq3g0uoce5ffa9j85h74t9skosfj92dp4ce7eamhsdg",
    "name": "ssh-6fd4f4bdf9-fwn5g"
  },
  "lease_id": {
    "owner": "akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h",
    "dseq": 5673203,
    "gseq": 1,
    "oseq": 1,
    "provider": "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"
  }
}
{
  "type": "Normal",
  "reason": "SuccessfulCreate",
  "note": "Created pod: ssh-6fd4f4bdf9-7b2p2",
  "object": {
    "kind": "ReplicaSet",
    "namespace": "vpjq3g0uoce5ffa9j85h74t9skosfj92dp4ce7eamhsdg",
    "name": "ssh-6fd4f4bdf9"
  },
  "lease_id": {
    "owner": "akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h",
    "dseq": 5673203,
    "gseq": 1,
    "oseq": 1,
    "provider": "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"
  }
}
{
  "type": "Normal",
  "reason": "SuccessfulCreate",
  "note": "Created pod: ssh-6fd4f4bdf9-fwn5g",
  "object": {
    "kind": "ReplicaSet",
    "namespace": "vpjq3g0uoce5ffa9j85h74t9skosfj92dp4ce7eamhsdg",
    "name": "ssh-6fd4f4bdf9"
  },
  "lease_id": {
    "owner": "akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h",
    "dseq": 5673203,
    "gseq": 1,
    "oseq": 1,
    "provider": "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"
  }
}
{
  "type": "Normal",
  "reason": "ScalingReplicaSet",
  "note": "Scaled up replica set ssh-6fd4f4bdf9 to 1",
  "object": {
    "kind": "Deployment",
    "namespace": "vpjq3g0uoce5ffa9j85h74t9skosfj92dp4ce7eamhsdg",
    "name": "ssh"
  },
  "lease_id": {
    "owner": "akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h",
    "dseq": 5673203,
    "gseq": 1,
    "oseq": 1,
    "provider": "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"
  }
}
hydrogen18 commented 2 years ago

I thought I had looked at something similar in the past, let me see if I can find it

hydrogen18 commented 2 years ago

Looks similar to issue akash-network/node#1480 . Maybe this got reintroduced somehow?

boz commented 2 years ago

maybe the fix wasn't applied to the master branch previously

hydrogen18 commented 2 years ago

Reproduced on master branch, will have to track this down to see what is going on

boz commented 2 years ago

I mean was it fixed on mainnet/main when mainnet/main was v0.14.x then lost when master was merged into mainnet/main for 0.16.x?

hydrogen18 commented 2 years ago

It's hitting this line I think we need to filter out the pods that have failed & whatnot before trying to run the command

https://github.com/ovrclk/akash/blob/master/provider/cluster/kube/client_exec.go#L100

hydrogen18 commented 2 years ago

@arno01 oh yeah now that I look at this are you sure the pod restarts? When I do this locally while watching the kubernetes cluster the pod moves to "completed". After a while the provider closes the lease because the containers aren't running.

The kubernetes pod has a restart policy of always, but apparently that doesn't mean anything of the sort

$ kubectl get pod --namespace=cul2933lrothig1100l4s5ra710m53f6sol2mncvhht3m web-77db64bfd-cn8jk  -o=jsonpath='{.spec.restartPolicy}' && echo
Always

I tried changing to "OnFailure" (since "Never" seems like a poor choice) but that gives me this error

E[2022-05-04|13:50:34.410] applying deployment                          module=provider-cluster-kube err="Deployment.apps \"bew\" is invalid: spec.template.spec.restartPolicy: Unsupported value: \"OnFailure\": supported values: \"Always\"" lease=akash178ctpsxaa4fcyq0fwtds4qx2ha0maluwll87wx/12/1/1/akash1xglzcfu4g9her6xhz95fk78h9555qaxz70cf4s service=bew
hydrogen18 commented 2 years ago

@sacreman any suggestions here?

arno01 commented 2 years ago

@hydrogen18 I've tested this again just now:

TL;DR Looks like that issue is isolated to a single provider - Europlots. I would think of closing this issue, but as you've also reproduced it maybe you want to check more things?

Evidence (Lumen)

$ curl -s -k https://provider.mainnet-1.ca.aksh.pw:8443/version | jq 
{
  "akash": {
    "version": "v0.16.4-rc0",
    "commit": "38b82258c14e3d0a2ed3d15a8d4140ec8c826a84",
    "buildTags": "\"osusergo,netgo,ledger,static_build\"",
    "go": "go version go1.17.6 linux/amd64",
    "cosmosSdkVersion": "v0.45.1"
  },
  "kube": {
    "major": "1",
    "minor": "23",
    "gitVersion": "v1.23.5",
    "gitCommit": "c285e781331a3785a7f436042c65c5641ce8a9e9",
    "gitTreeState": "clean",
    "buildDate": "2022-03-16T15:52:18Z",
    "goVersion": "go1.17.8",
    "compiler": "gc",
    "platform": "linux/amd64"
  }
}

lease-events.1.txt

$ akash provider lease-events > lease-events.1
$ cat lease-events.1 | jq -r '[(.lease_id | .dseq, .gseq, .oseq, .provider), (.object | .kind, .name), .type, .reason, .note] | @csv' | column -t -s","
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Ingress"     "k9dq760v49f5t6l6v2hbqts7ac.ingress.mainnet-1.ca.aksh.pw"  "Normal"   "Sync"               "Scheduled for sync"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Ingress"     "k9dq760v49f5t6l6v2hbqts7ac.ingress.mainnet-1.ca.aksh.pw"  "Normal"   "Sync"               "Scheduled for sync"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Ingress"     "k9dq760v49f5t6l6v2hbqts7ac.ingress.mainnet-1.ca.aksh.pw"  "Normal"   "Sync"               "Scheduled for sync"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Ingress"     "k9dq760v49f5t6l6v2hbqts7ac.ingress.mainnet-1.ca.aksh.pw"  "Normal"   "Sync"               "Scheduled for sync"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Ingress"     "k9dq760v49f5t6l6v2hbqts7ac.ingress.mainnet-1.ca.aksh.pw"  "Normal"   "Sync"               "Scheduled for sync"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Ingress"     "k9dq760v49f5t6l6v2hbqts7ac.ingress.mainnet-1.ca.aksh.pw"  "Normal"   "Sync"               "Scheduled for sync"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Ingress"     "k9dq760v49f5t6l6v2hbqts7ac.ingress.mainnet-1.ca.aksh.pw"  "Normal"   "Sync"               "Scheduled for sync"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Ingress"     "k9dq760v49f5t6l6v2hbqts7ac.ingress.mainnet-1.ca.aksh.pw"  "Normal"   "Sync"               "Scheduled for sync"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Ingress"     "k9dq760v49f5t6l6v2hbqts7ac.ingress.mainnet-1.ca.aksh.pw"  "Normal"   "Sync"               "Scheduled for sync"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Ingress"     "k9dq760v49f5t6l6v2hbqts7ac.ingress.mainnet-1.ca.aksh.pw"  "Normal"   "Sync"               "Scheduled for sync"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Pod"         "ssh-7c9bb88b9f-54fvx"                                     "Normal"   "Scheduled"          "Successfully assigned ujrprcbfd0sjljt11f1rbignp2b65knk76qjphearskt8/ssh-7c9bb88b9f-54fvx to k8s-node-9.mainnet-1.ca"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Pod"         "ssh-7c9bb88b9f-54fvx"                                     "Normal"   "Pulling"            "Pulling image ""ubuntu:21.10"""
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Pod"         "ssh-7c9bb88b9f-54fvx"                                     "Normal"   "Pulled"             "Successfully pulled image ""ubuntu:21.10"" in 3.38558492s"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Pod"         "ssh-7c9bb88b9f-54fvx"                                     "Normal"   "Created"            "Created container ssh"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Pod"         "ssh-7c9bb88b9f-54fvx"                                     "Normal"   "Started"            "Started container ssh"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Pod"         "ssh-7c9bb88b9f-rzxbx"                                     "Normal"   "Scheduled"          "Successfully assigned ujrprcbfd0sjljt11f1rbignp2b65knk76qjphearskt8/ssh-7c9bb88b9f-rzxbx to k8s-node-5.mainnet-1.ca"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Pod"         "ssh-7c9bb88b9f-rzxbx"                                     "Normal"   "Pulling"            "Pulling image ""ubuntu:21.10"""
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Pod"         "ssh-7c9bb88b9f-rzxbx"                                     "Normal"   "Pulled"             "Successfully pulled image ""ubuntu:21.10"" in 3.385080374s"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Pod"         "ssh-7c9bb88b9f-rzxbx"                                     "Normal"   "Created"            "Created container ssh"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Pod"         "ssh-7c9bb88b9f-rzxbx"                                     "Normal"   "Started"            "Started container ssh"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Pod"         "ssh-7c9bb88b9f-rzxbx"                                     "Warning"  "Evicted"            "Container ssh exceeded its local ephemeral storage limit ""1073741824"". "
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Pod"         "ssh-7c9bb88b9f-rzxbx"                                     "Normal"   "Killing"            "Stopping container ssh"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "ReplicaSet"  "ssh-7c9bb88b9f"                                           "Normal"   "SuccessfulCreate"   "Created pod: ssh-7c9bb88b9f-rzxbx"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "ReplicaSet"  "ssh-7c9bb88b9f"                                           "Normal"   "SuccessfulCreate"   "Created pod: ssh-7c9bb88b9f-54fvx"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Deployment"  "ssh"                                                      "Normal"   "ScalingReplicaSet"  "Scaled up replica set ssh-7c9bb88b9f to 1"
$ kubectl get pods -A -o wide | grep ssh
ujrprcbfd0sjljt11f1rbignp2b65knk76qjphearskt8   ssh-7c9bb88b9f-rzxbx                                              1/1     Running     0               2m1s    10.233.109.137    k8s-node-5.mainnet-1.ca        <none>           <none>

$ kubectl get pods -A -o wide | grep ssh
ujrprcbfd0sjljt11f1rbignp2b65knk76qjphearskt8   ssh-7c9bb88b9f-54fvx                                              0/1     ContainerCreating   0               9s      <none>            k8s-node-9.mainnet-1.ca        <none>           <none>
ujrprcbfd0sjljt11f1rbignp2b65knk76qjphearskt8   ssh-7c9bb88b9f-rzxbx                                              0/1     Completed           0               2m11s   10.233.109.137    k8s-node-5.mainnet-1.ca        <none>           <none>

$ kubectl get pods -A -o wide | grep ssh
ujrprcbfd0sjljt11f1rbignp2b65knk76qjphearskt8   ssh-7c9bb88b9f-54fvx                                              1/1     Running     0               47s     10.233.99.87      k8s-node-9.mainnet-1.ca        <none>           <none>
ujrprcbfd0sjljt11f1rbignp2b65knk76qjphearskt8   ssh-7c9bb88b9f-rzxbx                                              0/1     Completed   0               2m49s   10.233.109.137    k8s-node-5.mainnet-1.ca        <none>           <none>

Evidence (Europlots)

$ curl -s -k https://provider.europlots.com:8443/version | jq 
{
  "akash": {
    "version": "v0.16.4-rc0",
    "commit": "38b82258c14e3d0a2ed3d15a8d4140ec8c826a84",
    "buildTags": "\"osusergo,netgo,ledger,static_build\"",
    "go": "go version go1.17.6 linux/amd64",
    "cosmosSdkVersion": "v0.45.1"
  },
  "kube": {
    "major": "1",
    "minor": "23",
    "gitVersion": "v1.23.5",
    "gitCommit": "c285e781331a3785a7f436042c65c5641ce8a9e9",
    "gitTreeState": "clean",
    "buildDate": "2022-03-16T15:52:18Z",
    "goVersion": "go1.17.8",
    "compiler": "gc",
    "platform": "linux/amd64"
  }
}

lease-events.2.txt

$ akash provider lease-events > lease-events.2
$ cat lease-events.2 | jq -r '[(.lease_id | .dseq, .gseq, .oseq, .provider), (.object | .kind, .name), .type, .reason, .note] | @csv' | column -t -s","
5823531  1  1  "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"  "Ingress"     "gi52llqkrh8u98i6m3j0udd95c.ingress.europlots.com"  "Normal"   "Sync"               "Scheduled for sync"
5823531  1  1  "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"  "Pod"         "ssh-6ff4cf85f-gt6t2"                               "Normal"   "Scheduled"          "Successfully assigned e8eivkd2u9j2vcvp7jjjsgi3uc65on2sqro3td0bjpfro/ssh-6ff4cf85f-gt6t2 to node3"
5823531  1  1  "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"  "Pod"         "ssh-6ff4cf85f-gt6t2"                               "Normal"   "Pulled"             "Container image ""ubuntu:21.10"" already present on machine"
5823531  1  1  "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"  "Pod"         "ssh-6ff4cf85f-gt6t2"                               "Normal"   "Created"            "Created container ssh"
5823531  1  1  "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"  "Pod"         "ssh-6ff4cf85f-gt6t2"                               "Normal"   "Started"            "Started container ssh"
5823531  1  1  "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"  "Pod"         "ssh-6ff4cf85f-gt6t2"                               "Warning"  "Evicted"            "Container ssh exceeded its local ephemeral storage limit ""1073741824"". "
5823531  1  1  "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"  "Pod"         "ssh-6ff4cf85f-gt6t2"                               "Normal"   "Killing"            "Stopping container ssh"
5823531  1  1  "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"  "Pod"         "ssh-6ff4cf85f-zh4bk"                               "Normal"   "Scheduled"          "Successfully assigned e8eivkd2u9j2vcvp7jjjsgi3uc65on2sqro3td0bjpfro/ssh-6ff4cf85f-zh4bk to node3"
5823531  1  1  "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"  "Pod"         "ssh-6ff4cf85f-zh4bk"                               "Normal"   "Pulled"             "Container image ""ubuntu:21.10"" already present on machine"
5823531  1  1  "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"  "Pod"         "ssh-6ff4cf85f-zh4bk"                               "Normal"   "Created"            "Created container ssh"
5823531  1  1  "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"  "Pod"         "ssh-6ff4cf85f-zh4bk"                               "Normal"   "Started"            "Started container ssh"
5823531  1  1  "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"  "ReplicaSet"  "ssh-6ff4cf85f"                                     "Normal"   "SuccessfulCreate"   "Created pod: ssh-6ff4cf85f-gt6t2"
5823531  1  1  "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"  "ReplicaSet"  "ssh-6ff4cf85f"                                     "Normal"   "SuccessfulCreate"   "Created pod: ssh-6ff4cf85f-zh4bk"
5823531  1  1  "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"  "Deployment"  "ssh"                                               "Normal"   "ScalingReplicaSet"  "Scaled up replica set ssh-6ff4cf85f to 1"

I've asked the provider for kubectl get pods -A -o wide output, but he is Away.. Shortly before I've asked him, he said that he's got some deployment that is still Terminating. He was testing storage speed with Chia deployment and closed the lease, but it is still running:

# kubectl get pods --all-namespaces
NAMESPACE                                       NAME                                       READY   STATUS        RESTARTS       AGE
...
...
rgvf9cu3vacspjp0o9hdn8q4hc1pno1k7g2u8mjgrh3ha   chia-65b7fc4d96-62px7                      1/1     Terminating   0              163m
rgvf9cu3vacspjp0o9hdn8q4hc1pno1k7g2u8mjgrh3ha   chia-65b7fc4d96-v85qr                      1/1     Running       0              39m

Having that the namespace is same, there must be some issue on his side.

Evidence (Akash.Pro)

This is my provider

$ curl -s -k https://provider.akash.pro:8443/version | jq 
{
  "akash": {
    "version": "v0.16.4-rc0",
    "commit": "38b82258c14e3d0a2ed3d15a8d4140ec8c826a84",
    "buildTags": "\"osusergo,netgo,ledger,static_build\"",
    "go": "go version go1.17.6 linux/amd64",
    "cosmosSdkVersion": "v0.45.1"
  },
  "kube": {
    "major": "1",
    "minor": "23",
    "gitVersion": "v1.23.6",
    "gitCommit": "ad3338546da947756e8a88aa6822e9c11e7eac22",
    "gitTreeState": "clean",
    "buildDate": "2022-04-14T08:43:11Z",
    "goVersion": "go1.17.9",
    "compiler": "gc",
    "platform": "linux/amd64"
  }
}

lease-events.3.txt

$ cat lease-events.3 | jq -r '[(.lease_id | .dseq, .gseq, .oseq, .provider), (.object | .kind, .name), .type, .reason, .note] | @csv' | column -t -s","
5823715  1  1  "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0"  "Ingress"     "7efc7i47i9euj7laotatgtpt7c.ingress.akash.pro"  "Normal"   "Sync"               "Scheduled for sync"
5823715  1  1  "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0"  "Pod"         "ssh-79cc8d4674-2hqht"                          "Normal"   "Scheduled"          "Successfully assigned ddqp0svbeqjnkiicq5d53c3dfduo83cm03b14btomvgsc/ssh-79cc8d4674-2hqht to node1"
5823715  1  1  "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0"  "Pod"         "ssh-79cc8d4674-2hqht"                          "Normal"   "Pulled"             "Container image ""ubuntu:21.10"" already present on machine"
5823715  1  1  "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0"  "Pod"         "ssh-79cc8d4674-2hqht"                          "Normal"   "Created"            "Created container ssh"
5823715  1  1  "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0"  "Pod"         "ssh-79cc8d4674-2hqht"                          "Normal"   "Started"            "Started container ssh"
5823715  1  1  "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0"  "Pod"         "ssh-79cc8d4674-zszts"                          "Normal"   "Scheduled"          "Successfully assigned ddqp0svbeqjnkiicq5d53c3dfduo83cm03b14btomvgsc/ssh-79cc8d4674-zszts to node1"
5823715  1  1  "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0"  "Pod"         "ssh-79cc8d4674-zszts"                          "Normal"   "Pulled"             "Container image ""ubuntu:21.10"" already present on machine"
5823715  1  1  "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0"  "Pod"         "ssh-79cc8d4674-zszts"                          "Normal"   "Created"            "Created container ssh"
5823715  1  1  "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0"  "Pod"         "ssh-79cc8d4674-zszts"                          "Normal"   "Started"            "Started container ssh"
5823715  1  1  "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0"  "Pod"         "ssh-79cc8d4674-zszts"                          "Warning"  "Evicted"            "Container ssh exceeded its local ephemeral storage limit ""1073741824"". "
5823715  1  1  "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0"  "Pod"         "ssh-79cc8d4674-zszts"                          "Normal"   "Killing"            "Stopping container ssh"
5823715  1  1  "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0"  "ReplicaSet"  "ssh-79cc8d4674"                                "Normal"   "SuccessfulCreate"   "Created pod: ssh-79cc8d4674-zszts"
5823715  1  1  "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0"  "ReplicaSet"  "ssh-79cc8d4674"                                "Normal"   "SuccessfulCreate"   "Created pod: ssh-79cc8d4674-2hqht"
5823715  1  1  "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0"  "Deployment"  "ssh"                                           "Normal"   "ScalingReplicaSet"  "Scaled up replica set ssh-79cc8d4674 to 1"
root@node1:~# kubectl get pods -A -o wide | grep ssh
ddqp0svbeqjnkiicq5d53c3dfduo83cm03b14btomvgsc   ssh-79cc8d4674-zszts                       1/1     Running   0          27s    10.233.90.30   node1   <none>           <none>
root@node1:~# kubectl get pods -A -o wide | grep ssh
ddqp0svbeqjnkiicq5d53c3dfduo83cm03b14btomvgsc   ssh-79cc8d4674-2hqht                       0/1     ContainerCreating   0          0s     <none>         node1   <none>           <none>
ddqp0svbeqjnkiicq5d53c3dfduo83cm03b14btomvgsc   ssh-79cc8d4674-zszts                       0/1     Completed           0          33s    10.233.90.30   node1   <none>           <none>
root@node1:~# kubectl get pods -A -o wide | grep ssh
ddqp0svbeqjnkiicq5d53c3dfduo83cm03b14btomvgsc   ssh-79cc8d4674-2hqht                       1/1     Running     0          2s     10.233.90.31   node1   <none>           <none>
ddqp0svbeqjnkiicq5d53c3dfduo83cm03b14btomvgsc   ssh-79cc8d4674-zszts                       0/1     Completed   0          35s    10.233.90.30   node1   <none>           <none>
tidrolpolelsef commented 2 years ago

I'm confused that we can't seem to reproduce this across all providers uniformly at this point. Do we know if there are any differences in configuration between those?

chandadharap commented 1 year ago

There is a workaround per @boz , making it sev2

andy108369 commented 1 year ago

There have been new finding in https://github.com/ovrclk/engineering/issues/538 Closing in favor of that one.