Closed andy108369 closed 8 months ago
I've tried the light image of mine with sleep 10; exit 1;
and 50Gi
storage requested; provider seem to be running fine.
@anilmurty pointed out this log line which would suggest that's the size of the image is causing the issue, which I'm now more inclined to believe is the main reason:
Failed to garbage collect required amount of images. Attempted to free 5118015897 bytes, but only found 80233010 bytes eligible to free.
as the image Foundry tried is enormous - 21.7GB
:
$ docker images | grep zjuuu/comfyui
zjuuu/comfyui 0.6 ce0a48d3f9c2 2 weeks ago 21.7GB
couldn't repro the node disk pressure with running zjuuu/comfyui:0.6
with sleep 10; exit 1;
however I could repro it when I used the default entrypoint of that image.
I guess it is doing something there that triggers the issue or so; gonna investigate deeper
The behavior is quite different, depending on whether the image was cached or being pulled.
1) image was not cached, sleep 10 & exit 1
entrypoint - disk pressure, pod evicted;
Running image zjuuu/comfyui:0.6
with sleep 10; exit 1;
today was pulling the image again (seen from lease-events
), knocking off the node after some time due to the disk pressure.
2) image is cached, sleep 10 & exit 1
entrypoint - no disk pressure;
Re-running image zjuuu/comfyui:0.6
with sleep 10; exit 1;
again, did not pull the image again (no Pulling image
event in lease-events
), the node was not knocked off.
3) image is cached, sleep infinity
entrypoint and then bash -x /entrypoint.sh
(original entrypoint) exectued manually - no disk pressure, pod dies:
Additionally, running image zjuuu/comfyui:0.6
with sleep infinity
and then executing bash -x /entrypoint.sh
(the standard entrypoint of that image) breaks the pod which then gets its lease terminated (likely when it reaches monitorMaxRetries
which is about 4-5 minutes ).
The difference from the original behavior is that it supposed to restart that pod in sleep infinity
mode (as per SDL) which is not happening.
It is possible that it spawns another replica/pod which gets indefinitely stuck in "Pending". This would be the case if the Foundry provider didn't implement the Force New ReplicaSet Workaround
https://docs.akash.network/providers/akash-provider-troubleshooting/force-new-replicaset-workaround which would otherwise alleviate this situation. I'll ask them to verify/apply it today.
SDL - original with slight modifications, primarily:
command:
- "sh"
- "-c"
args:
- 'sleep infinity; exit 1;'
lease-shell
into the deployment:
fcb@ssh-7ffd9dd59c-59b5x:/$ bash -x /entrypoint.sh
...
Exit code
137
error happens when a container or pod was terminated due to high memory usage.
After which:
lease-shell
returns kube: lease not found
;
lease-events
returns no output;
blockchain PoV: market {order, bid, lease}, deployment = are all active & open;
when akash-provider reaches monitorMaxRetries
it then closes the lease, client then gets Error: no active leases found for dseq=13480440
as expected
[13480440-1-1]$ akash_shell bash
Detected provider for 13480440/1/1: akash17gqmzu0lnh2uclx9flm755arylrhgqy7udj3el
provider error messsage:
kube: lease not found
Error: remote server returned 400
[13480440-1-1]$ akash_status
Detected provider for 13480440/1/1: akash17gqmzu0lnh2uclx9flm755arylrhgqy7udj3el
provider error messsage:
kube: lease not found
$ provider_info.sh provider.akash.foundrystaking.com
type cpu gpu ram ephemeral persistent
used 6.5 1 35.5 50.5 0
pending 6 1 35 50 0 <<<<<< my deployment (`zjuuu/comfyui:0.6`)
available 638.35 53 4204.639930725098 4297.102982298471 0
node 95.125 8 753.6953830718994 790.7263174308464 N/A
node 89.625 6 719.1953601837158 741.2263174308464 N/A
node 95.425 8 503.05484199523926 791.2263174308464 N/A
node 94.85 8 502.97010040283203 791.2263174308464 N/A
node 73 7 468.37353897094727 38.08480133768171 N/A <<<<< pers.storage was 88Gi before the (`zjuuu/comfyui:0.6`) deployment; as well as 79 CPU, 8 GPU, 503Gi RAM
node 94.9 8 503.2803649902344 353.38659380655736 N/A
node 95.425 8 754.0703411102295 791.2263174308464 N/A
Before the deployment:
node 79 8 503.37353897094727 88.08480133768171 N/A
4) image was not cached, sleep infinity
- disk pressure, pod evicted;
Running image zjuuu/comfyui:0.6
with sleep infinity;
after few hours -- was pulling the image again (seen from lease-events
), knocking off the deployment after some time due to the disk pressure. Provider eventually closed its lease as expected.
So the whole problem is the large image / low ephemeral storage space (nodefs, imagefs).
There are certain thresholds which can be tweaked around: https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/
The pods have been running well on that node; i.e. the node itself wasn't getting evicted but rather just the deployment itself.
The node was disappearing from the akash-provider report (8443/status) for short time while it was collecting the garbage.
I guess we are good then.
The pods have been running on prd-stk-tsr-dgx-41
through the entire time of this issue:
ubuntu@prd-stk-tsr-dgx-41:~$ sudo kubectl get pods -A -o wide | grep prd-stk-tsr-dgx-41
ingress-nginx ingress-nginx-controller-rs88r 1/1 Running 0 21h 10.233.84.232 prd-stk-tsr-dgx-41 <none> <none>
kube-system calico-node-hwbjd 1/1 Running 0 7d21h 10.40.160.141 prd-stk-tsr-dgx-41 <none> <none>
kube-system coredns-5c469774b8-vz7z6 1/1 Running 0 7d21h 10.233.84.193 prd-stk-tsr-dgx-41 <none> <none>
kube-system kube-apiserver-prd-stk-tsr-dgx-41 1/1 Running 1 7d21h 10.40.160.141 prd-stk-tsr-dgx-41 <none> <none>
kube-system kube-controller-manager-prd-stk-tsr-dgx-41 1/1 Running 2 7d21h 10.40.160.141 prd-stk-tsr-dgx-41 <none> <none>
kube-system kube-proxy-pxcm9 1/1 Running 0 6d19h 10.40.160.141 prd-stk-tsr-dgx-41 <none> <none>
kube-system kube-scheduler-prd-stk-tsr-dgx-41 1/1 Running 1 7d21h 10.40.160.141 prd-stk-tsr-dgx-41 <none> <none>
kube-system nodelocaldns-5gwlp 1/1 Running 0 7d21h 10.40.160.141 prd-stk-tsr-dgx-41 <none> <none>
nvidia-device-plugin nvdp-nvidia-device-plugin-f457b 1/1 Running 0 7d 10.233.84.199 prd-stk-tsr-dgx-41 <none> <none>
prometheus prometheus-prometheus-node-exporter-rkcpv 1/1 Running 0 21h 10.40.160.141 prd-stk-tsr-dgx-41 <none> <none>
It appears the provider is running chaperone utility which kills certain deployments.
Their file_types =
was set to ssh
& sshd
(I'm not sharing the entire config here for obvious reason).
And since I've been using the sshd-based one, it was getting killed:
logs from chaperone:
Oct 31 11:47:51 prd-stk-tsr-dgx-41 python3[1150795]: We found a process running with ssh match in pod ssh-548995d65d-clh4q Oct 31 11:47:51 prd-stk-tsr-dgx-41 python3[1150795]: We found a process running with sshd match in pod ssh-548995d65d-clh4q Oct 31 11:47:51 prd-stk-tsr-dgx-41 python3[1150795]: We found a process running with chm match in pod ssh-548995d65d-clh4q Oct 31 11:47:51 prd-stk-tsr-dgx-41 python3[1150795]: Deleted namespace: 3ad5plmb0b7ivmob1dris8ikptv9ok2443kjo2oas6l5m
Going to re-test the 4th scenario.
5) image was not cached, sleep infinity
- disk pressure, pod evicted; (same as 4.
except chaperone
isn't killing the pod now)
The pod comfy2-58d5c7b44d-558f4
failed due to:
Message: The node was low on resource: ephemeral-storage. Threshold quantity: 15763389861, available: 12139364Ki.
and the node prd-stk-tsr-dgx-41
had KubeletHasDiskPressure
Provider closed the lease.
FWIW: provider wasn't accessible initially because it was missing the haproxy rule to redirect 8443 (akash-provider) to the worker node it's been running at.
fns3s2v3eu25jps6bvr42de873t3iun1vngci22uf9ru6 comfy2-58d5c7b44d-558f4 0/1 Error 0 20m
fns3s2v3eu25jps6bvr42de873t3iun1vngci22uf9ru6 comfy2-58d5c7b44d-svl82 0/1 Pending 0 94s
ubuntu@prd-stk-tsr-dgx-41:~$ sudo kubectl get pods -A -o wide | grep prd-stk-tsr-dgx-41
fns3s2v3eu25jps6bvr42de873t3iun1vngci22uf9ru6 comfy2-58d5c7b44d-558f4 0/1 Error 0 21m 10.233.84.239 prd-stk-tsr-dgx-41 <none> <none>
ingress-nginx ingress-nginx-controller-qcrn9 0/1 Evicted 0 83s <none> prd-stk-tsr-dgx-41 <none> <none>
kube-system calico-node-hwbjd 1/1 Running 0 7d23h 10.40.160.141 prd-stk-tsr-dgx-41 <none> <none>
kube-system coredns-5c469774b8-vz7z6 1/1 Running 0 7d23h 10.233.84.193 prd-stk-tsr-dgx-41 <none> <none>
kube-system kube-apiserver-prd-stk-tsr-dgx-41 1/1 Running 1 7d23h 10.40.160.141 prd-stk-tsr-dgx-41 <none> <none>
kube-system kube-controller-manager-prd-stk-tsr-dgx-41 1/1 Running 2 7d23h 10.40.160.141 prd-stk-tsr-dgx-41 <none> <none>
kube-system kube-proxy-sxkd7 1/1 Running 0 150m 10.40.160.141 prd-stk-tsr-dgx-41 <none> <none>
kube-system kube-scheduler-prd-stk-tsr-dgx-41 1/1 Running 1 7d23h 10.40.160.141 prd-stk-tsr-dgx-41 <none> <none>
kube-system nodelocaldns-5gwlp 1/1 Running 0 7d23h 10.40.160.141 prd-stk-tsr-dgx-41 <none> <none>
nvidia-device-plugin nvdp-nvidia-device-plugin-f457b 1/1 Running 0 7d2h 10.233.84.199 prd-stk-tsr-dgx-41 <none> <none>
prometheus prometheus-prometheus-node-exporter-5xlxh 0/1 Evicted 0 80s <none> prd-stk-tsr-dgx-41 <none> <none>
ubuntu@prd-stk-tsr-dgx-41:~$ sudo kubectl describe node prd-stk-tsr-dgx-41
Name: prd-stk-tsr-dgx-41
Roles: control-plane
Labels: akash.network/capabilities.gpu.vendor.nvidia.model.v100=true
beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=prd-stk-tsr-dgx-41
kubernetes.io/os=linux
node-role.kubernetes.io/control-plane=
node.kubernetes.io/exclude-from-external-load-balancers=
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
node.alpha.kubernetes.io/ttl: 0
projectcalico.org/IPv4Address: 10.40.160.141/24
projectcalico.org/IPv4VXLANTunnelAddr: 10.233.84.192
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Mon, 23 Oct 2023 15:28:48 -0400
Taints: node.kubernetes.io/disk-pressure:NoSchedule
Unschedulable: false
Lease:
HolderIdentity: prd-stk-tsr-dgx-41
AcquireTime: <unset>
RenewTime: Tue, 31 Oct 2023 15:13:05 -0400
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Mon, 23 Oct 2023 15:30:45 -0400 Mon, 23 Oct 2023 15:30:45 -0400 CalicoIsUp Calico is running on this node
MemoryPressure False Tue, 31 Oct 2023 15:13:12 -0400 Mon, 23 Oct 2023 15:28:47 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure True Tue, 31 Oct 2023 15:13:12 -0400 Tue, 31 Oct 2023 15:05:02 -0400 KubeletHasDiskPressure kubelet has disk pressure
PIDPressure False Tue, 31 Oct 2023 15:13:12 -0400 Mon, 23 Oct 2023 15:28:47 -0400 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Tue, 31 Oct 2023 15:13:12 -0400 Mon, 23 Oct 2023 15:32:02 -0400 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 10.40.160.141
Hostname: prd-stk-tsr-dgx-41
Capacity:
cpu: 80
ephemeral-storage: 102626232Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 528225832Ki
nvidia.com/gpu: 8
pods: 110
Allocatable:
cpu: 80
ephemeral-storage: 94580335255
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 528123432Ki
nvidia.com/gpu: 8
pods: 110
System Info:
Machine ID: 437e807e8fe84f478c670855995fba64
System UUID: 81628b67-a46e-e811-ab21-d8c49769155b
Boot ID: e6730582-abb0-4afd-bd01-2771350a1782
Kernel Version: 5.15.0-84-generic
OS Image: Ubuntu 22.04.3 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.7.5
Kubelet Version: v1.27.5
Kube-Proxy Version: v1.27.5
PodCIDR: 10.233.64.0/24
PodCIDRs: 10.233.64.0/24
Non-terminated Pods: (8 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system calico-node-hwbjd 150m (0%) 300m (0%) 64M (0%) 500M (0%) 7d23h
kube-system coredns-5c469774b8-vz7z6 100m (0%) 0 (0%) 70Mi (0%) 300Mi (0%) 7d23h
kube-system kube-apiserver-prd-stk-tsr-dgx-41 250m (0%) 0 (0%) 0 (0%) 0 (0%) 7d23h
kube-system kube-controller-manager-prd-stk-tsr-dgx-41 200m (0%) 0 (0%) 0 (0%) 0 (0%) 7d23h
kube-system kube-proxy-sxkd7 0 (0%) 0 (0%) 0 (0%) 0 (0%) 150m
kube-system kube-scheduler-prd-stk-tsr-dgx-41 100m (0%) 0 (0%) 0 (0%) 0 (0%) 7d23h
kube-system nodelocaldns-5gwlp 100m (0%) 0 (0%) 70Mi (0%) 200Mi (0%) 7d23h
nvidia-device-plugin nvdp-nvidia-device-plugin-f457b 0 (0%) 0 (0%) 0 (0%) 0 (0%) 7d2h
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 900m (1%) 300m (0%)
memory 210800640 (0%) 1024288k (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
nvidia.com/gpu 0 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal NodeHasNoDiskPressure 13m (x18 over 7d23h) kubelet Node prd-stk-tsr-dgx-41 status is now: NodeHasNoDiskPressure
Normal NodeHasDiskPressure 8m12s (x18 over 3d6h) kubelet Node prd-stk-tsr-dgx-41 status is now: NodeHasDiskPressure
Warning FreeDiskSpaceFailed 6m19s kubelet Failed to garbage collect required amount of images. Attempted to free 9002490265 bytes, but only found 0 bytes eligible to free.
Warning EvictionThresholdMet 3m15s (x54 over 3d6h) kubelet Attempting to reclaim ephemeral-storage
ubuntu@prd-stk-tsr-dgx-41:~$ sudo kubectl -n fns3s2v3eu25jps6bvr42de873t3iun1vngci22uf9ru6 describe pod comfy2-58d5c7b44d-558f4
Name: comfy2-58d5c7b44d-558f4
Namespace: fns3s2v3eu25jps6bvr42de873t3iun1vngci22uf9ru6
Priority: 0
Runtime Class Name: nvidia
Service Account: default
Node: prd-stk-tsr-dgx-41/10.40.160.141
Start Time: Tue, 31 Oct 2023 14:50:53 -0400
Labels: akash.network=true
akash.network/manifest-service=comfy2
akash.network/namespace=fns3s2v3eu25jps6bvr42de873t3iun1vngci22uf9ru6
pod-template-hash=58d5c7b44d
Annotations: cni.projectcalico.org/containerID: 43bef3f4ef7fbe33e7c178e973a1fa03d631232e4efa998dba09d63837e07369
cni.projectcalico.org/podIP:
cni.projectcalico.org/podIPs:
Status: Failed
Reason: Evicted
Message: The node was low on resource: ephemeral-storage. Threshold quantity: 15763389861, available: 12139364Ki.
IP: 10.233.84.239
IPs:
IP: 10.233.84.239
Controlled By: ReplicaSet/comfy2-58d5c7b44d
Containers:
comfy2:
Container ID: containerd://9989c3a5b93e795a56f635d976a895ffd1b436831f39b2c3fa0a8ae51b387f1a
Image: zjuuu/comfyui:0.6
Image ID: docker.io/zjuuu/comfyui@sha256:7a6cf7c24e1c74b223b87fda62b47fa161a21333c072b49e02f09dd486626588
Port: 8080/TCP
Host Port: 0/TCP
Command:
sh
-c
Args:
sleep infinity; exit 1;
State: Terminated
Reason: Error
Exit Code: 137
Started: Tue, 31 Oct 2023 14:57:01 -0400
Finished: Tue, 31 Oct 2023 15:09:58 -0400
Ready: False
Restart Count: 0
Limits:
cpu: 6
ephemeral-storage: 53687091200
memory: 37580963840
nvidia.com/gpu: 1
Requests:
cpu: 6
ephemeral-storage: 53687091200
memory: 37580963840
nvidia.com/gpu: 1
Environment:
ENABLE_MANAGER: true
VAEURLS:
MODELURLS: https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0.safetensors,https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/resolve/main/sd_xl_refiner_1.0.safetensors
UPSCALEURLS:
COMMANDLINE_ARGS: --listen --port 8080
AKASH_GROUP_SEQUENCE: 1
AKASH_DEPLOYMENT_SEQUENCE: 13485628
AKASH_ORDER_SEQUENCE: 1
AKASH_OWNER: akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys
AKASH_PROVIDER: akash17gqmzu0lnh2uclx9flm755arylrhgqy7udj3el
AKASH_CLUSTER_PUBLIC_HOSTNAME: provider.akash.foundrystaking.com
Mounts: <none>
Conditions:
Type Status
DisruptionTarget True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes: <none>
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 22m default-scheduler Successfully assigned fns3s2v3eu25jps6bvr42de873t3iun1vngci22uf9ru6/comfy2-58d5c7b44d-558f4 to prd-stk-tsr-dgx-41
Normal Pulling 22m kubelet Pulling image "zjuuu/comfyui:0.6"
Normal Pulled 16m kubelet Successfully pulled image "zjuuu/comfyui:0.6" in 6m7.123648259s (6m7.12366841s including waiting)
Normal Created 16m kubelet Created container comfy2
Normal Started 16m kubelet Started container comfy2
Warning Evicted 4m3s kubelet The node was low on resource: ephemeral-storage. Threshold quantity: 15763389861, available: 12139364Ki.
Normal Killing 4m3s kubelet Stopping container comfy2
Warning ExceededGracePeriod 3m53s kubelet Container runtime did not kill the pod within specified grace period.
ubuntu@prd-stk-tsr-dgx-41:~$ sudo kubectl -n fns3s2v3eu25jps6bvr42de873t3iun1vngci22uf9ru6 describe pod comfy2-58d5c7b44d-svl82
Name: comfy2-58d5c7b44d-svl82
Namespace: fns3s2v3eu25jps6bvr42de873t3iun1vngci22uf9ru6
Priority: 0
Runtime Class Name: nvidia
Service Account: default
Node: <none>
Labels: akash.network=true
akash.network/manifest-service=comfy2
akash.network/namespace=fns3s2v3eu25jps6bvr42de873t3iun1vngci22uf9ru6
pod-template-hash=58d5c7b44d
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/comfy2-58d5c7b44d
Containers:
comfy2:
Image: zjuuu/comfyui:0.6
Port: 8080/TCP
Host Port: 0/TCP
Command:
sh
-c
Args:
sleep infinity; exit 1;
Limits:
cpu: 6
ephemeral-storage: 53687091200
memory: 37580963840
nvidia.com/gpu: 1
Requests:
cpu: 6
ephemeral-storage: 53687091200
memory: 37580963840
nvidia.com/gpu: 1
Environment:
ENABLE_MANAGER: true
VAEURLS:
MODELURLS: https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0.safetensors,https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/resolve/main/sd_xl_refiner_1.0.safetensors
UPSCALEURLS:
COMMANDLINE_ARGS: --listen --port 8080
AKASH_GROUP_SEQUENCE: 1
AKASH_DEPLOYMENT_SEQUENCE: 13485628
AKASH_ORDER_SEQUENCE: 1
AKASH_OWNER: akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys
AKASH_PROVIDER: akash17gqmzu0lnh2uclx9flm755arylrhgqy7udj3el
AKASH_CLUSTER_PUBLIC_HOSTNAME: provider.akash.foundrystaking.com
Mounts: <none>
Conditions:
Type Status
PodScheduled False
Volumes: <none>
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3m48s default-scheduler 0/8 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }, 7 node(s) didn't match Pod's node affinity/selector. preemption: 0/8 nodes are available: 8 Preemption is not helpful for scheduling..
Looks like the provider increase the disk space, so the SDL can be retried there again.
$ provider_info.sh provider.akash.foundrystaking.com
type cpu gpu ram ephemeral persistent
used 10.5 1 39.5 54.5 0
pending 0 0 0 0 0
available 704.725 46 4742.572359085083 12225.30169552099 0
node 89.625 6 719.1953601837158 741.2263174308464 N/A
node 93.625 8 501.17984199523926 789.2263174308464 N/A
node 94.85 8 502.97010040283203 791.2263174308464 N/A
node 79 8 503.37351989746094 6384.962164035067 N/A
node 252.725 8 2012.5731716156006 3165.273985386826 N/A
node 94.9 8 503.2803649902344 353.38659380655736 N/A
The Foundry encountered an issue where a node with
88Gi
of available ephemeral disk space was experiencingNodeHasDiskPressure
. This is evident from the events logged by the kubelet
:Correction:
The node isn't getting fully evicted but it is getting logged as in poor condition and evicting the pods there:
The issue was linked to a particular deployment, the contents of which can be found at this link. Due to continuous failures, Kubernetes kept trying to restart the deployment:
To recreate the issue, a simple SDL was used:
The available disk space reduced by twice the requested amount every
10
seconds, with the pod status alternating betweenError
andPending
. This indicated that Kubernetes was continually trying to restart the pod.Upon checking the resource consumption before and after submitting the SDL, it was found that the CPU, memory, and storage were all being consumed at twice the rate requested by the deployment. Before the SDL submission, the resource status was:
After the SDL submission:
A few seconds later:
When the deployment started to crash and redeploy:
This resulted in unexpected resource consumption, with the node having nearly double the resources than it was supposed to:
Logs from Foundry:
And provider goes offline too (since akash-provider pod gets evicted as well from that node).