Bottlerocket under-reports Ephemeral Storage Capacity

jonathan-innis commented 1 year ago

Image I'm using:

AMI Name: bottlerocket-aws-k8s-1.22-aarch64-v1.11.1-104f8e0f

What I expected to happen:

I expected that the capacity on my worker node would be approximately close to what the actual EBS volume size is to the xvdb mount for my node filesystem.

bash-5.1# df -h
Filesystem       Size  Used Avail Use% Mounted on
/dev/root        904M  557M  285M  67% /
devtmpfs          16G     0   16G   0% /dev
tmpfs             16G     0   16G   0% /dev/shm
tmpfs            6.1G  1.2M  6.1G   1% /run
tmpfs            4.0M     0  4.0M   0% /sys/fs/cgroup
tmpfs             16G  476K   16G   1% /etc
tmpfs             16G  4.0K   16G   1% /etc/cni
tmpfs             16G     0   16G   0% /tmp
tmpfs             16G  4.0K   16G   1% /etc/containerd
tmpfs             16G   12K   16G   1% /etc/host-containers
tmpfs             16G  4.0K   16G   1% /etc/kubernetes/pki
tmpfs             16G     0   16G   0% /root/.aws
/dev/nvme1n1p1   4.3T  1.6G  4.1T   1% /local
/dev/nvme0n1p12   36M  944K   32M   3% /var/lib/bottlerocket
overlay          4.3T  1.6G  4.1T   1% /aarch64-bottlerocket-linux-gnu/sys-root/usr/lib/modules
overlay          4.3T  1.6G  4.1T   1% /opt/cni/bin
/dev/loop1       384K  384K     0 100% /aarch64-bottlerocket-linux-gnu/sys-root/usr/share/licenses
/dev/loop0        12M   12M     0 100% /var/lib/kernel-devel/.overlay/lower
overlay          4.3T  1.6G  4.1T   1% /aarch64-bottlerocket-linux-gnu/sys-root/usr/src/kernels

bash-5.1# lsblk
NAME         MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
loop0          7:0    0 11.6M  1 loop /var/lib/kernel-devel/.overlay/lower
loop1          7:1    0  292K  1 loop /aarch64-bottlerocket-linux-gnu/sys-root/usr/share/licenses
nvme0n1      259:0    0    2G  0 disk 
|-nvme0n1p1  259:2    0    4M  0 part 
|-nvme0n1p2  259:3    0    5M  0 part 
|-nvme0n1p3  259:4    0   40M  0 part /boot
|-nvme0n1p4  259:5    0  920M  0 part 
|-nvme0n1p5  259:6    0   10M  0 part 
|-nvme0n1p6  259:7    0   25M  0 part 
|-nvme0n1p7  259:8    0    5M  0 part 
|-nvme0n1p8  259:9    0   40M  0 part 
|-nvme0n1p9  259:10   0  920M  0 part 
|-nvme0n1p10 259:11   0   10M  0 part 
|-nvme0n1p11 259:12   0   25M  0 part 
`-nvme0n1p12 259:13   0   42M  0 part /var/lib/bottlerocket
nvme1n1      259:1    0  4.3T  0 disk 
`-nvme1n1p1  259:14   0  4.3T  0 part /var
                                      /opt
                                      /mnt
                                      /local

Screen Shot 2023-01-19 at 11 13 03 AM

What actually happened:

status:
  addresses:
  - address: 192.168.124.121
    type: InternalIP
  - address: ip-192-168-124-121.us-west-2.compute.internal
    type: Hostname
  - address: ip-192-168-124-121.us-west-2.compute.internal
    type: InternalDNS
  allocatable:
    attachable-volumes-aws-ebs: "39"
    cpu: 15890m
    ephemeral-storage: "1342050565150"
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    hugepages-32Mi: "0"
    hugepages-64Ki: "0"
    memory: 28738288Ki
    pods: "234"
    vpc.amazonaws.com/pod-eni: "54"
  capacity:
    attachable-volumes-aws-ebs: "39"
    cpu: "16"
    ephemeral-storage: 1457383148Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    hugepages-32Mi: "0"
    hugepages-64Ki: "0"
    memory: 31737584Ki
    pods: "234"
    vpc.amazonaws.com/pod-eni: "54"

CAdvisor or something in the BR image appears to be under-reporting the amount of capacity that I have on this worker node.

The ephemeral-storage capacity here is approximately 1457383148Ki ~= 1.35 Ti which is not close to the ~4.3T that the lsblk is reporting.

How to reproduce the problem:

Launch a image with the BR AMI that has an EBS volume above > 4TB attached to the /dev/xvdb mount
View the Node's capacity when it connects and joins to the cluster using kubectl get node or kubectl describe node

jpculp commented 1 year ago

@jonathan-innis, thanks for reaching out! We're taking a deeper look into this.

jonathan-innis commented 1 year ago

@jpculp Is there any progress or updates on this issue?

jpculp commented 1 year ago

Unfortunately not yet. We have to take a deeper look at the interaction between the host, containerd, and cAdvisor. Out of curiosity, do you see the same behavior with bottlerocket-aws-k8s-1.24?

jonathan-innis commented 1 year ago

I haven't taken a look at the newer version on K8s 1.24. Let me take a look on a newer version of K8s and get back to you on that.

etungsten commented 1 year ago

Hi @jonathan-innis, although I haven't fully root-caused the issue. I wanted to provide an update to offer some information.

I took a deeper look into this and it seems like the issue is stemming from kubelet not refreshing the node status or publishing the wrong filesystem stats to the K8s API.

When kubelet first starts up, cAdvisor hasn't fully initialized before kubelet makes the call to query for filesystem stats so kubelet reports invalid capacity 0 on image filesystem. Apparently this is expected to happen sometimes and eventually cadvisor will settle and start reporting stats. Whenever this happens, kubelet uses whatever stats it was able to scrounge up through this fallback. The linked issue in that code block mentions kubelet goes to the CRI for filesystem information. The issue is that those initial partial filesystem stats under-report available capacity as you've noticed. kubelet then doesn't attempt to update the K8s API with the correct filesystem stats even after cAdvisor is up and running.

After the node becomes ready, if I query the metrics endpoint, both cadvisor stats and node summary stats are reporting correctly:

cAdvisor:

...
container_fs_limit_bytes{container="",device="/dev/nvme1n1p1",id="/",image="",name="",namespace="",pod=""} 4.756159012864e+12 1675367414546

Node summary:

...
  "fs": {
   "time": "2023-02-02T19:50:04Z",
   "availableBytes": 4561598828544,
   "capacityBytes": 4756159012864,
   "usedBytes": 1264123904,
   "inodesFree": 294302950,
   "inodes": 294336000,
   "inodesUsed": 33050
  },

But for some reason, the node object in the cluster does not reflect that in the K8s API: kubectl describe node

  Hostname:     ip-192-168-92-51.us-west-2.compute.internal
Capacity:
  attachable-volumes-aws-ebs:  25
  cpu:                         4
  ephemeral-storage:           1073420188Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      16073648Ki
  pods:                        58
Allocatable:
  attachable-volumes-aws-ebs:  25
  cpu:                         3920m
  ephemeral-storage:           988190301799
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      15056816Ki
  pods:                        58

Only reports ~988 GB

What's interesting is that once you either reboot the worker node or restart the kubelet service, the stats sync up correctly: After rebooting, kubectl describe node:

  Hostname:     ip-192-168-92-51.us-west-2.compute.internal
Capacity:                                                         
  attachable-volumes-aws-ebs:  25                                                                                                      cpu:                         4
  ephemeral-storage:           4644686536Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      16073648Ki
  pods:                        58 
Allocatable:
  attachable-volumes-aws-ebs:  25 
  cpu:                         3920m
  ephemeral-storage:           4279469362667
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      15056816Ki
  pods:                        58

ephemeral-storage correctly reports the 4.2 TB available.

So it seems like kubelet is not updating node filesystem stats to the K8s API as frequently as it should. I currently can't explain why this is happening on Bottlerocket and not replicating on AL2. I suspect the kubelet fallback to querying the CRI has something to do with it, i.e. cri/containerd vs dockershim (https://github.com/kubernetes/kubernetes/pull/51152).

If you want to work around this issue, you can reboot the nodes or restart kubelet to get kubelet to start reporting the correct ephemeral storage capacity. In the meantime, I'll spend more time digging into kubelet/containerd.

stmcginnis commented 1 year ago

Wondering if you are still seeing this behavior. If so, do the stats eventually correct themselves, or once it's in this state does it keep reporting the wrong size indefinitely.

There's a 10 second cache timeout for stats, so I wonder if we are hitting a case where the data in the cache needs to be invalidated before it actually checks again and gets the full storage space.

tweeks-reify commented 8 months ago

~~We ran into a similar issue as well on 1.28 resulting in pods unschedulable due to insufficient storage. I tried rebooting but that didn't seem to work~~

~~In our case we have a second EBS volume (1TB) we are using and it seems like its not picking it up at all.~~

I didn't realized I needed to specify the device as /dev/xvdb (/dev/xvda works on the aws linux ami), works fine once updated to that.

James-Quigley commented 8 months ago

Still seeing this behavior. EKS 1.25. Entering admin container > sudo sheltie > systemctl restart kubelet.service causes it to start to report the correct value for ephemeral storage

James-Quigley commented 7 months ago

FWIW, recently upgraded to 1.26, and the behavior is there as well

ginglis13 commented 5 months ago

Hi @James-Quigley @jonathan-innis, I suspect this issue might be addressed by changes to include monitoring of the container runtime cgroup by kubelet https://github.com/bottlerocket-os/bottlerocket/pull/3804. Are you still seeing this issue on versions of Bottlerocket >= 1.19.5?

bottlerocket-os / bottlerocket

Bottlerocket under-reports Ephemeral Storage Capacity #2743