NodeHasDiskPressure Causing Pod Evictions Due to Excessive Disk Usage

andy108369 commented 6 months ago

akash network: sandbox-01 akash network version: v0.34.0 (binary v0.34.1) akash provider version: 0.6.1

Description

provider.provider-02.sandbox-01.aksh.pw has encountered NodeHasDiskPressure. The node ran out of available disk space, causing Kubernetes to evict pods to reclaim disk space.

Relevant events are as follows:

$ kubectl get events -A --sort-by='.lastTimestamp'
...
akash-services                                  3m43s       Warning   Evicted                 pod/akash-provider-0                                                        The node was low on resource: ephemeral-storage. Threshold quantity: 31189488855, available: 29362700Ki. Container provider was using 26060Ki, request is 0, has larger consumption of ephemeral-storage.
default                                         3m35s       Normal    NodeHasDiskPressure     node/node1                                                                  Node node1 status is now: NodeHasDiskPressure

More detailed events & logs can be found here.

This issue can arise with any deployment that writes a significant amount of data, leading to excessive disk usage. This causes the node to exceed the nodefs threshold, triggering Kubernetes to start the eviction process in an attempt to reclaim ephemeral storage:

$ kubectl get events -A --sort-by='.lastTimestamp' | grep reclaim
default                                         16m         Warning   EvictionThresholdMet    node/node1                                                                  Attempting to reclaim ephemeral-storage

Additionally, it appears that the image size is not being taken into account when determining available space. For instance, a worker node might have 150GB of free disk space, allowing a tenant to claim that space for their deployment. However, if the image itself is large (e.g., 12GB), this can trigger the eviction threshold:

root@node1:~# crictl image ps | grep llama
docker.io/yuravorobei/llama-2                           0.6                 5122212d50e6a       12.1GB

Disk usage on node1 is rapidly increasing, indicating a potential risk for further evictions:

root@node1:~# iotop
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                                                                                                                                                                                                                        
2506135 be/4 root        0.00 B/s  266.15 M/s  ?unavailable?  python /usr/local/bin/uvicorn main:app --host 0.0.0.0 --port 7860
...

root@node1:~# df -Ph /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1       194G  134G   60G  70% /
root@node1:~# df -Ph /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1       194G  143G   52G  74% /

Despite having resource limits in place:

$ kubectl -n $ns get deployment/app -o yaml
...
        resources:
          limits:
            cpu: "4"
            ephemeral-storage: "161061273600"
            memory: "16106127360"
            nvidia.com/gpu: "1"
          requests:
            cpu: "4"
            ephemeral-storage: "161061273600"
            memory: "16106127360"
            nvidia.com/gpu: "1"

The following are the thresholds for memory (RAM), the nodefs (/var/lib/kubelet aka ephemeral-storage), and imagefs (/var/lib/containerd):

memory.available<100Mi
nodefs.available<10%
imagefs.available<15%

Refer to Kubernetes Node Pressure Eviction Documentation for more details.

Reproducer

Deploy a large image, say 5GiB in size, with the maximum available disk space for the worker node. The available disk space can be obtained from the 8443/status provider endpoint, ensuring the deployment lands on the intended node.
Once deployed, SSH into the node (or use lease-shell) and start writing data. Use real data instead of zeroes to accurately impact disk space usage (df -Ph / can be checked from the Pod or the worker Host directly).
Once all disk space is utilized, let the system sit for 5-10 minutes for the kubelet to notice the issue (NodeHasDiskPressure event) and start evicting pods to reclaim ephemeral-storage space (EvictionThresholdMet and Evicted events).

The issue The problem is that the K8s eviction manager starts evicting other pods, not just the culprit pod, likely since it can't know which one is the culprit.

Potential solution The long term solution could be having the imagefs (/var/lib/containerd) on a separate partition, other than the nodefs (/var/lib/kubelet).

Action required

Please investigate and implement measures to account for image size when determining available space and manage disk usage more effectively to prevent future evictions.

Additional context

I've seen this issue some time ago, when tenants would attempt deploying heavy container images (>10GiB in size) and when the worker node did not have enough free space.

anilmurty commented 3 months ago

@devalpatel67 - please drop any ideas or suggestions for estimating image size before downloading it to the provider

andy108369 commented 2 months ago

Hey @devalpatel67 , any updates please?

FWIW, for k3s https://github.com/akash-network/support/issues/217#issuecomment-2233826404

akash-network / support