Open jonathan-innis opened 1 year ago
@jonathan-innis, thanks for reaching out! We're taking a deeper look into this.
@jpculp Is there any progress or updates on this issue?
Unfortunately not yet. We have to take a deeper look at the interaction between the host, containerd, and cAdvisor. Out of curiosity, do you see the same behavior with bottlerocket-aws-k8s-1.24
?
I haven't taken a look at the newer version on K8s 1.24. Let me take a look on a newer version of K8s and get back to you on that.
Hi @jonathan-innis, although I haven't fully root-caused the issue. I wanted to provide an update to offer some information.
I took a deeper look into this and it seems like the issue is stemming from kubelet
not refreshing the node status or publishing the wrong filesystem stats to the K8s API.
When kubelet
first starts up, cAdvisor hasn't fully initialized before kubelet
makes the call to query for filesystem stats so kubelet reports invalid capacity 0 on image filesystem
. Apparently this is expected to happen sometimes and eventually cadvisor will settle and start reporting stats. Whenever this happens, kubelet
uses whatever stats it was able to scrounge up through this fallback. The linked issue in that code block mentions kubelet
goes to the CRI for filesystem information. The issue is that those initial partial filesystem stats under-report available capacity as you've noticed. kubelet
then doesn't attempt to update the K8s API with the correct filesystem stats even after cAdvisor is up and running.
After the node becomes ready, if I query the metrics endpoint, both cadvisor stats and node summary stats are reporting correctly:
cAdvisor:
...
container_fs_limit_bytes{container="",device="/dev/nvme1n1p1",id="/",image="",name="",namespace="",pod=""} 4.756159012864e+12 1675367414546
Node summary:
...
"fs": {
"time": "2023-02-02T19:50:04Z",
"availableBytes": 4561598828544,
"capacityBytes": 4756159012864,
"usedBytes": 1264123904,
"inodesFree": 294302950,
"inodes": 294336000,
"inodesUsed": 33050
},
But for some reason, the node object in the cluster does not reflect that in the K8s API:
kubectl describe node
Hostname: ip-192-168-92-51.us-west-2.compute.internal
Capacity:
attachable-volumes-aws-ebs: 25
cpu: 4
ephemeral-storage: 1073420188Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 16073648Ki
pods: 58
Allocatable:
attachable-volumes-aws-ebs: 25
cpu: 3920m
ephemeral-storage: 988190301799
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 15056816Ki
pods: 58
Only reports ~988 GB
What's interesting is that once you either reboot the worker node or restart the kubelet service, the stats sync up correctly:
After rebooting, kubectl describe node
:
Hostname: ip-192-168-92-51.us-west-2.compute.internal
Capacity:
attachable-volumes-aws-ebs: 25 cpu: 4
ephemeral-storage: 4644686536Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 16073648Ki
pods: 58
Allocatable:
attachable-volumes-aws-ebs: 25
cpu: 3920m
ephemeral-storage: 4279469362667
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 15056816Ki
pods: 58
ephemeral-storage
correctly reports the 4.2 TB available.
So it seems like kubelet
is not updating node filesystem stats to the K8s API as frequently as it should. I currently can't explain why this is happening on Bottlerocket and not replicating on AL2. I suspect the kubelet
fallback to querying the CRI has something to do with it, i.e. cri/containerd
vs dockershim
(https://github.com/kubernetes/kubernetes/pull/51152).
If you want to work around this issue, you can reboot the nodes or restart kubelet to get kubelet
to start reporting the correct ephemeral storage capacity. In the meantime, I'll spend more time digging into kubelet
/containerd
.
Wondering if you are still seeing this behavior. If so, do the stats eventually correct themselves, or once it's in this state does it keep reporting the wrong size indefinitely.
There's a 10 second cache timeout for stats, so I wonder if we are hitting a case where the data in the cache needs to be invalidated before it actually checks again and gets the full storage space.
We ran into a similar issue as well on 1.28 resulting in pods unschedulable due to insufficient storage. I tried rebooting but that didn't seem to work
In our case we have a second EBS volume (1TB) we are using and it seems like its not picking it up at all.
I didn't realized I needed to specify the device as /dev/xvdb
(/dev/xvda
works on the aws linux ami), works fine once updated to that.
Still seeing this behavior. EKS 1.25. Entering admin container > sudo sheltie > systemctl restart kubelet.service
causes it to start to report the correct value for ephemeral storage
FWIW, recently upgraded to 1.26, and the behavior is there as well
Hi @James-Quigley @jonathan-innis, I suspect this issue might be addressed by changes to include monitoring of the container runtime cgroup by kubelet https://github.com/bottlerocket-os/bottlerocket/pull/3804. Are you still seeing this issue on versions of Bottlerocket >= 1.19.5?
Image I'm using:
AMI Name:
bottlerocket-aws-k8s-1.22-aarch64-v1.11.1-104f8e0f
What I expected to happen:
I expected that the capacity on my worker node would be approximately close to what the actual EBS volume size is to the
xvdb
mount for my node filesystem.What actually happened:
CAdvisor or something in the BR image appears to be under-reporting the amount of capacity that I have on this worker node.
The
ephemeral-storage
capacity here is approximately1457383148Ki ~= 1.35 Ti
which is not close to the ~4.3T that thelsblk
is reporting.How to reproduce the problem:
/dev/xvdb
mountkubectl get node
orkubectl describe node