Closed arno01 closed 1 year ago
I thought I had looked at something similar in the past, let me see if I can find it
Looks similar to issue akash-network/node#1480 . Maybe this got reintroduced somehow?
maybe the fix wasn't applied to the master branch previously
Reproduced on master branch, will have to track this down to see what is going on
I mean was it fixed on mainnet/main
when mainnet/main
was v0.14.x
then lost when master
was merged into mainnet/main
for 0.16.x
?
It's hitting this line I think we need to filter out the pods that have failed & whatnot before trying to run the command
https://github.com/ovrclk/akash/blob/master/provider/cluster/kube/client_exec.go#L100
@arno01 oh yeah now that I look at this are you sure the pod restarts? When I do this locally while watching the kubernetes cluster the pod moves to "completed". After a while the provider closes the lease because the containers aren't running.
The kubernetes pod has a restart policy of always, but apparently that doesn't mean anything of the sort
$ kubectl get pod --namespace=cul2933lrothig1100l4s5ra710m53f6sol2mncvhht3m web-77db64bfd-cn8jk -o=jsonpath='{.spec.restartPolicy}' && echo
Always
I tried changing to "OnFailure" (since "Never" seems like a poor choice) but that gives me this error
E[2022-05-04|13:50:34.410] applying deployment module=provider-cluster-kube err="Deployment.apps \"bew\" is invalid: spec.template.spec.restartPolicy: Unsupported value: \"OnFailure\": supported values: \"Always\"" lease=akash178ctpsxaa4fcyq0fwtds4qx2ha0maluwll87wx/12/1/1/akash1xglzcfu4g9her6xhz95fk78h9555qaxz70cf4s service=bew
@sacreman any suggestions here?
@hydrogen18 I've tested this again just now:
TL;DR Looks like that issue is isolated to a single provider - Europlots. I would think of closing this issue, but as you've also reproduced it maybe you want to check more things?
$ akash provider lease-shell --tty --stdin -- ssh bash
Error: lease shell failed: remote command execute error: service with that name is not running: the service has failed
:8443/version
reports are same (1:1);$ curl -s -k https://provider.mainnet-1.ca.aksh.pw:8443/version | jq
{
"akash": {
"version": "v0.16.4-rc0",
"commit": "38b82258c14e3d0a2ed3d15a8d4140ec8c826a84",
"buildTags": "\"osusergo,netgo,ledger,static_build\"",
"go": "go version go1.17.6 linux/amd64",
"cosmosSdkVersion": "v0.45.1"
},
"kube": {
"major": "1",
"minor": "23",
"gitVersion": "v1.23.5",
"gitCommit": "c285e781331a3785a7f436042c65c5641ce8a9e9",
"gitTreeState": "clean",
"buildDate": "2022-03-16T15:52:18Z",
"goVersion": "go1.17.8",
"compiler": "gc",
"platform": "linux/amd64"
}
}
$ akash provider lease-events > lease-events.1
$ cat lease-events.1 | jq -r '[(.lease_id | .dseq, .gseq, .oseq, .provider), (.object | .kind, .name), .type, .reason, .note] | @csv' | column -t -s","
5823330 1 1 "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf" "Ingress" "k9dq760v49f5t6l6v2hbqts7ac.ingress.mainnet-1.ca.aksh.pw" "Normal" "Sync" "Scheduled for sync"
5823330 1 1 "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf" "Ingress" "k9dq760v49f5t6l6v2hbqts7ac.ingress.mainnet-1.ca.aksh.pw" "Normal" "Sync" "Scheduled for sync"
5823330 1 1 "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf" "Ingress" "k9dq760v49f5t6l6v2hbqts7ac.ingress.mainnet-1.ca.aksh.pw" "Normal" "Sync" "Scheduled for sync"
5823330 1 1 "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf" "Ingress" "k9dq760v49f5t6l6v2hbqts7ac.ingress.mainnet-1.ca.aksh.pw" "Normal" "Sync" "Scheduled for sync"
5823330 1 1 "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf" "Ingress" "k9dq760v49f5t6l6v2hbqts7ac.ingress.mainnet-1.ca.aksh.pw" "Normal" "Sync" "Scheduled for sync"
5823330 1 1 "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf" "Ingress" "k9dq760v49f5t6l6v2hbqts7ac.ingress.mainnet-1.ca.aksh.pw" "Normal" "Sync" "Scheduled for sync"
5823330 1 1 "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf" "Ingress" "k9dq760v49f5t6l6v2hbqts7ac.ingress.mainnet-1.ca.aksh.pw" "Normal" "Sync" "Scheduled for sync"
5823330 1 1 "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf" "Ingress" "k9dq760v49f5t6l6v2hbqts7ac.ingress.mainnet-1.ca.aksh.pw" "Normal" "Sync" "Scheduled for sync"
5823330 1 1 "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf" "Ingress" "k9dq760v49f5t6l6v2hbqts7ac.ingress.mainnet-1.ca.aksh.pw" "Normal" "Sync" "Scheduled for sync"
5823330 1 1 "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf" "Ingress" "k9dq760v49f5t6l6v2hbqts7ac.ingress.mainnet-1.ca.aksh.pw" "Normal" "Sync" "Scheduled for sync"
5823330 1 1 "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf" "Pod" "ssh-7c9bb88b9f-54fvx" "Normal" "Scheduled" "Successfully assigned ujrprcbfd0sjljt11f1rbignp2b65knk76qjphearskt8/ssh-7c9bb88b9f-54fvx to k8s-node-9.mainnet-1.ca"
5823330 1 1 "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf" "Pod" "ssh-7c9bb88b9f-54fvx" "Normal" "Pulling" "Pulling image ""ubuntu:21.10"""
5823330 1 1 "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf" "Pod" "ssh-7c9bb88b9f-54fvx" "Normal" "Pulled" "Successfully pulled image ""ubuntu:21.10"" in 3.38558492s"
5823330 1 1 "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf" "Pod" "ssh-7c9bb88b9f-54fvx" "Normal" "Created" "Created container ssh"
5823330 1 1 "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf" "Pod" "ssh-7c9bb88b9f-54fvx" "Normal" "Started" "Started container ssh"
5823330 1 1 "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf" "Pod" "ssh-7c9bb88b9f-rzxbx" "Normal" "Scheduled" "Successfully assigned ujrprcbfd0sjljt11f1rbignp2b65knk76qjphearskt8/ssh-7c9bb88b9f-rzxbx to k8s-node-5.mainnet-1.ca"
5823330 1 1 "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf" "Pod" "ssh-7c9bb88b9f-rzxbx" "Normal" "Pulling" "Pulling image ""ubuntu:21.10"""
5823330 1 1 "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf" "Pod" "ssh-7c9bb88b9f-rzxbx" "Normal" "Pulled" "Successfully pulled image ""ubuntu:21.10"" in 3.385080374s"
5823330 1 1 "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf" "Pod" "ssh-7c9bb88b9f-rzxbx" "Normal" "Created" "Created container ssh"
5823330 1 1 "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf" "Pod" "ssh-7c9bb88b9f-rzxbx" "Normal" "Started" "Started container ssh"
5823330 1 1 "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf" "Pod" "ssh-7c9bb88b9f-rzxbx" "Warning" "Evicted" "Container ssh exceeded its local ephemeral storage limit ""1073741824"". "
5823330 1 1 "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf" "Pod" "ssh-7c9bb88b9f-rzxbx" "Normal" "Killing" "Stopping container ssh"
5823330 1 1 "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf" "ReplicaSet" "ssh-7c9bb88b9f" "Normal" "SuccessfulCreate" "Created pod: ssh-7c9bb88b9f-rzxbx"
5823330 1 1 "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf" "ReplicaSet" "ssh-7c9bb88b9f" "Normal" "SuccessfulCreate" "Created pod: ssh-7c9bb88b9f-54fvx"
5823330 1 1 "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf" "Deployment" "ssh" "Normal" "ScalingReplicaSet" "Scaled up replica set ssh-7c9bb88b9f to 1"
$ kubectl get pods -A -o wide | grep ssh
ujrprcbfd0sjljt11f1rbignp2b65knk76qjphearskt8 ssh-7c9bb88b9f-rzxbx 1/1 Running 0 2m1s 10.233.109.137 k8s-node-5.mainnet-1.ca <none> <none>
$ kubectl get pods -A -o wide | grep ssh
ujrprcbfd0sjljt11f1rbignp2b65knk76qjphearskt8 ssh-7c9bb88b9f-54fvx 0/1 ContainerCreating 0 9s <none> k8s-node-9.mainnet-1.ca <none> <none>
ujrprcbfd0sjljt11f1rbignp2b65knk76qjphearskt8 ssh-7c9bb88b9f-rzxbx 0/1 Completed 0 2m11s 10.233.109.137 k8s-node-5.mainnet-1.ca <none> <none>
$ kubectl get pods -A -o wide | grep ssh
ujrprcbfd0sjljt11f1rbignp2b65knk76qjphearskt8 ssh-7c9bb88b9f-54fvx 1/1 Running 0 47s 10.233.99.87 k8s-node-9.mainnet-1.ca <none> <none>
ujrprcbfd0sjljt11f1rbignp2b65knk76qjphearskt8 ssh-7c9bb88b9f-rzxbx 0/1 Completed 0 2m49s 10.233.109.137 k8s-node-5.mainnet-1.ca <none> <none>
$ curl -s -k https://provider.europlots.com:8443/version | jq
{
"akash": {
"version": "v0.16.4-rc0",
"commit": "38b82258c14e3d0a2ed3d15a8d4140ec8c826a84",
"buildTags": "\"osusergo,netgo,ledger,static_build\"",
"go": "go version go1.17.6 linux/amd64",
"cosmosSdkVersion": "v0.45.1"
},
"kube": {
"major": "1",
"minor": "23",
"gitVersion": "v1.23.5",
"gitCommit": "c285e781331a3785a7f436042c65c5641ce8a9e9",
"gitTreeState": "clean",
"buildDate": "2022-03-16T15:52:18Z",
"goVersion": "go1.17.8",
"compiler": "gc",
"platform": "linux/amd64"
}
}
$ akash provider lease-events > lease-events.2
$ cat lease-events.2 | jq -r '[(.lease_id | .dseq, .gseq, .oseq, .provider), (.object | .kind, .name), .type, .reason, .note] | @csv' | column -t -s","
5823531 1 1 "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc" "Ingress" "gi52llqkrh8u98i6m3j0udd95c.ingress.europlots.com" "Normal" "Sync" "Scheduled for sync"
5823531 1 1 "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc" "Pod" "ssh-6ff4cf85f-gt6t2" "Normal" "Scheduled" "Successfully assigned e8eivkd2u9j2vcvp7jjjsgi3uc65on2sqro3td0bjpfro/ssh-6ff4cf85f-gt6t2 to node3"
5823531 1 1 "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc" "Pod" "ssh-6ff4cf85f-gt6t2" "Normal" "Pulled" "Container image ""ubuntu:21.10"" already present on machine"
5823531 1 1 "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc" "Pod" "ssh-6ff4cf85f-gt6t2" "Normal" "Created" "Created container ssh"
5823531 1 1 "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc" "Pod" "ssh-6ff4cf85f-gt6t2" "Normal" "Started" "Started container ssh"
5823531 1 1 "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc" "Pod" "ssh-6ff4cf85f-gt6t2" "Warning" "Evicted" "Container ssh exceeded its local ephemeral storage limit ""1073741824"". "
5823531 1 1 "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc" "Pod" "ssh-6ff4cf85f-gt6t2" "Normal" "Killing" "Stopping container ssh"
5823531 1 1 "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc" "Pod" "ssh-6ff4cf85f-zh4bk" "Normal" "Scheduled" "Successfully assigned e8eivkd2u9j2vcvp7jjjsgi3uc65on2sqro3td0bjpfro/ssh-6ff4cf85f-zh4bk to node3"
5823531 1 1 "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc" "Pod" "ssh-6ff4cf85f-zh4bk" "Normal" "Pulled" "Container image ""ubuntu:21.10"" already present on machine"
5823531 1 1 "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc" "Pod" "ssh-6ff4cf85f-zh4bk" "Normal" "Created" "Created container ssh"
5823531 1 1 "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc" "Pod" "ssh-6ff4cf85f-zh4bk" "Normal" "Started" "Started container ssh"
5823531 1 1 "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc" "ReplicaSet" "ssh-6ff4cf85f" "Normal" "SuccessfulCreate" "Created pod: ssh-6ff4cf85f-gt6t2"
5823531 1 1 "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc" "ReplicaSet" "ssh-6ff4cf85f" "Normal" "SuccessfulCreate" "Created pod: ssh-6ff4cf85f-zh4bk"
5823531 1 1 "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc" "Deployment" "ssh" "Normal" "ScalingReplicaSet" "Scaled up replica set ssh-6ff4cf85f to 1"
I've asked the provider for kubectl get pods -A -o wide
output, but he is Away..
Shortly before I've asked him, he said that he's got some deployment that is still Terminating.
He was testing storage speed with Chia deployment and closed the lease, but it is still running:
# kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
...
...
rgvf9cu3vacspjp0o9hdn8q4hc1pno1k7g2u8mjgrh3ha chia-65b7fc4d96-62px7 1/1 Terminating 0 163m
rgvf9cu3vacspjp0o9hdn8q4hc1pno1k7g2u8mjgrh3ha chia-65b7fc4d96-v85qr 1/1 Running 0 39m
Having that the namespace is same, there must be some issue on his side.
This is my provider
$ curl -s -k https://provider.akash.pro:8443/version | jq
{
"akash": {
"version": "v0.16.4-rc0",
"commit": "38b82258c14e3d0a2ed3d15a8d4140ec8c826a84",
"buildTags": "\"osusergo,netgo,ledger,static_build\"",
"go": "go version go1.17.6 linux/amd64",
"cosmosSdkVersion": "v0.45.1"
},
"kube": {
"major": "1",
"minor": "23",
"gitVersion": "v1.23.6",
"gitCommit": "ad3338546da947756e8a88aa6822e9c11e7eac22",
"gitTreeState": "clean",
"buildDate": "2022-04-14T08:43:11Z",
"goVersion": "go1.17.9",
"compiler": "gc",
"platform": "linux/amd64"
}
}
$ cat lease-events.3 | jq -r '[(.lease_id | .dseq, .gseq, .oseq, .provider), (.object | .kind, .name), .type, .reason, .note] | @csv' | column -t -s","
5823715 1 1 "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0" "Ingress" "7efc7i47i9euj7laotatgtpt7c.ingress.akash.pro" "Normal" "Sync" "Scheduled for sync"
5823715 1 1 "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0" "Pod" "ssh-79cc8d4674-2hqht" "Normal" "Scheduled" "Successfully assigned ddqp0svbeqjnkiicq5d53c3dfduo83cm03b14btomvgsc/ssh-79cc8d4674-2hqht to node1"
5823715 1 1 "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0" "Pod" "ssh-79cc8d4674-2hqht" "Normal" "Pulled" "Container image ""ubuntu:21.10"" already present on machine"
5823715 1 1 "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0" "Pod" "ssh-79cc8d4674-2hqht" "Normal" "Created" "Created container ssh"
5823715 1 1 "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0" "Pod" "ssh-79cc8d4674-2hqht" "Normal" "Started" "Started container ssh"
5823715 1 1 "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0" "Pod" "ssh-79cc8d4674-zszts" "Normal" "Scheduled" "Successfully assigned ddqp0svbeqjnkiicq5d53c3dfduo83cm03b14btomvgsc/ssh-79cc8d4674-zszts to node1"
5823715 1 1 "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0" "Pod" "ssh-79cc8d4674-zszts" "Normal" "Pulled" "Container image ""ubuntu:21.10"" already present on machine"
5823715 1 1 "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0" "Pod" "ssh-79cc8d4674-zszts" "Normal" "Created" "Created container ssh"
5823715 1 1 "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0" "Pod" "ssh-79cc8d4674-zszts" "Normal" "Started" "Started container ssh"
5823715 1 1 "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0" "Pod" "ssh-79cc8d4674-zszts" "Warning" "Evicted" "Container ssh exceeded its local ephemeral storage limit ""1073741824"". "
5823715 1 1 "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0" "Pod" "ssh-79cc8d4674-zszts" "Normal" "Killing" "Stopping container ssh"
5823715 1 1 "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0" "ReplicaSet" "ssh-79cc8d4674" "Normal" "SuccessfulCreate" "Created pod: ssh-79cc8d4674-zszts"
5823715 1 1 "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0" "ReplicaSet" "ssh-79cc8d4674" "Normal" "SuccessfulCreate" "Created pod: ssh-79cc8d4674-2hqht"
5823715 1 1 "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0" "Deployment" "ssh" "Normal" "ScalingReplicaSet" "Scaled up replica set ssh-79cc8d4674 to 1"
root@node1:~# kubectl get pods -A -o wide | grep ssh
ddqp0svbeqjnkiicq5d53c3dfduo83cm03b14btomvgsc ssh-79cc8d4674-zszts 1/1 Running 0 27s 10.233.90.30 node1 <none> <none>
root@node1:~# kubectl get pods -A -o wide | grep ssh
ddqp0svbeqjnkiicq5d53c3dfduo83cm03b14btomvgsc ssh-79cc8d4674-2hqht 0/1 ContainerCreating 0 0s <none> node1 <none> <none>
ddqp0svbeqjnkiicq5d53c3dfduo83cm03b14btomvgsc ssh-79cc8d4674-zszts 0/1 Completed 0 33s 10.233.90.30 node1 <none> <none>
root@node1:~# kubectl get pods -A -o wide | grep ssh
ddqp0svbeqjnkiicq5d53c3dfduo83cm03b14btomvgsc ssh-79cc8d4674-2hqht 1/1 Running 0 2s 10.233.90.31 node1 <none> <none>
ddqp0svbeqjnkiicq5d53c3dfduo83cm03b14btomvgsc ssh-79cc8d4674-zszts 0/1 Completed 0 35s 10.233.90.30 node1 <none> <none>
I'm confused that we can't seem to reproduce this across all providers uniformly at this point. Do we know if there are any differences in configuration between those?
There is a workaround per @boz , making it sev2
There have been new finding in https://github.com/ovrclk/engineering/issues/538 Closing in favor of that one.
Reproducer
deploy something with 1 or 10 GB storage request (== limit);
consume more than the limit in step 1;
this will cause the pod to restart due to:
akash provider lease-shell
stops working;version 0.16.4-rc0
akash provider & client are of
0.16.4-rc0
versionlease-status after the eviction
lease-events logs