Closed aballman closed 8 months ago
Great debugging, this looks like a bug in the logic for rapid scaling: https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/ipamd/ipamd.go#L660
The code seeks to assign IPs until the desired state can be met, but if no more ENIs can be attached to the node, this for loop will spin indefinitely...
This is an unfortunate miss. While we fix this, I recommend downgrading to v1.16.2
. Can you also let us know the following:
Thanks for the quick response!
WARM_ENI_TARGET: 0
WARM_IP_TARGET: 10
No additional ENI attached No custom networking or security groups for pods Network Policy Agent is enabled
@aballman I did a bit more digging and have a fix to prevent the livelock, but what is curious is why the datastore is still returning "too low" in your case. When all ENIs are attached and all IPs are assigned to those ENIs (including warm IPs), then max pods should be hit and the datastore should stop trying to allocate IPs. It should actually happen sooner than max, since some pods run in host networking namespace.
For t3.medium
, after 18 IPs are allocated (3 ENIs*6 IPs per ENI), the datastore should never return "too low", as max pods should be set to 18. Can you check what --max-pods
is set to on the node? One way to do this is to just run systemctl status kubelet -l
on the instance and see the argument passed to kubelet.
My clusters are setup with karpenter and have a lot of autoscaling happening. I don't have any medium instances at the moment. I'll see if I can get one launched and report back.
The KubeletConfiguration
file specifies maxPods: 35
on ami-0f623fc60e44e28fc
/ v1.29.0-eks-a5ec690
for a t3.medium
The
KubeletConfiguration
file specifiesmaxPods: 35
onami-0f623fc60e44e28fc
/v1.29.0-eks-a5ec690
for at3.medium
Well that is definitely wrong. An instance cannot support more pods than it has IPs available to assign to them. We should track that issue separately, where there are two options:
The
KubeletConfiguration
file specifiesmaxPods: 35
onami-0f623fc60e44e28fc
/v1.29.0-eks-a5ec690
for at3.medium
Well that is definitely wrong. An instance cannot support more pods than it has IPs available to assign to them. We should track that issue separately, where there are two options:
- You file an AWS support case in your console, which will get the EKS nodes team involved
- You create an issue at https://github.com/awslabs/amazon-eks-ami/issues, which will also get the EKS nodes team involved, but likely more slowly
I'm using bottlerocket, would that be the same repo to report the issue to? Thanks for your help here
I'd like to report that we are also seeing this on v1.16.3, right after an upgrade to EKS 1.26 and AWS VPC CNI 1.16.3, it's a night and day difference, with the exact same error. The arrow notes the moment we upgraded.
The VPC CNI pods were restarted and it seems the CPU usage "normalized" by itself, I can't explain what changed aside from the pods no longer being in CrashLoopBackOff:
> kc get pods -l k8s-app=aws-node
NAME READY STATUS RESTARTS AGE
aws-node-2tj8f 2/2 Running 0 169m
aws-node-n42kk 2/2 Running 1 (87m ago) 16h
aws-node-v2ggb 1/2 CrashLoopBackOff 31 (115s ago) 3h53m
> kc logs aws-node-v2ggb
Defaulted container "aws-node" out of: aws-node, aws-eks-nodeagent, aws-vpc-cni-init (init)
Installed /host/opt/cni/bin/aws-cni
Installed /host/opt/cni/bin/egress-cni
time="2024-02-27T19:40:05Z" level=info msg="Starting IPAM daemon... "
time="2024-02-27T19:40:05Z" level=info msg="Checking for IPAM connectivity... "
time="2024-02-27T19:40:08Z" level=info msg="Copying config file... "
time="2024-02-27T19:40:08Z" level=info msg="Successfully copied CNI plugin binary and config file."
time="2024-02-27T19:42:17Z" level=error msg="Failed to wait for IPAM daemon to complete" error="signal: killed"
> kc delete pod aws-node-v2ggb
pod "aws-node-v2ggb" deleted
> kc top pod -l k8s-app=aws-node
NAME CPU(cores) MEMORY(bytes)
aws-node-2tj8f 4m 64Mi
aws-node-n42kk 1203m 48Mi
aws-node-nvqg7 1m 15Mi
This leads me to conclude this issue is a bit... Intermittent? I'm more inclined to believe when some pods were moved to different nodes and that reduces the number of IPs needed, solving the above issue.
We use the default EKS AMI and a bog standard installation of the VPC CNI too, our only configuration is related to resources:
{"resources":{"limits":{"memory":"128Mi"},"requests":{"cpu":"5m","memory":"50Mi"}}}
We'll attempt the downgrade and report back further.
@aballman ah good catch, so I am not sure if https://github.com/bottlerocket-os/bottlerocket/issues is the right repo to report this issue or not. Would you mind filing an AWS support case? That will definitely get it routed to the right team, albeit a bit more slowly than the convenience of GitHub Issues.
@Angelin01 which instance type are you seeing this on? And are you using Bottlerocket as well?
As a side note, https://github.com/aws/amazon-vpc-cni-k8s/pull/2810 should merge soon and be released in v1.16.4
, which is targeting mid-March.
@jdn5126 We use SpotInst Ocean as a means of using Spot Instances, but the underneath this is a standard EC2 with the standard EKS AMI (ami-0d00c9f7217e4079c). But our instances are mixed, large:
We do not use bottlerocket.
I also talked to another team that just created a test EKS cluster today on 1.28, not using SpotInst ocean and instead using AWS' Node groups, same issue.
@Angelin01 Got it, thanks for confirming. Seems there are more edge cases than just max pods, so we'll have to push to get v1.16.4 out sooner.
For reference, these nodes have the following number of pods, respectively:
Two of these are VERY close to the limit here, so it may indeed be related to max pods.
But yes, I'd still verify other cases, as I said a brand new cluster experienced the same issue. I don't know if you have a "yank" mechanism, but I'd pull the 1.16.3 release.
Confirmed downgrade to v1.16.2 fixes the issue:
> kc get pods -l k8s-app=aws-node
NAME READY STATUS RESTARTS AGE
aws-node-5szp6 2/2 Running 0 7m54s
aws-node-d5cbs 2/2 Running 0 7m19s
aws-node-vlqw5 2/2 Running 0 7m45s
> kc top pod -l k8s-app=aws-node
NAME CPU(cores) MEMORY(bytes)
aws-node-5szp6 5m 59Mi
aws-node-d5cbs 4m 57Mi
aws-node-vlqw5 4m 43Mi
@jdn5126 I appreciate the fast responses, I hope the fix doesn't prove to be too problematic. I also apologize for the comment spam.
@Angelin01 happy to help, this is an unfortunate regression that we need to make sure we have coverage for in the future. We do not have a "yank" mechanism, just documentation to recommend against updating or updating with caution. And then hopefully a quick turnaround in releasing a patched version.
@aballman I spoke briefly to the EKS Nodes team, and it looks like the incorrect max pods is a Bottlerocket issue, since they set KubeletConfiguration
. So we can file an issue at https://github.com/bottlerocket-os/bottlerocket/issues. When you spin up t3.medium
with the EKS optimized AMI, max pods is set to 17 as expected.
@jdn5126 Thanks, the follow up is much appreciated. I'll file an issue there.
We have just bumped into this issue. We run a fleet of 100+ EKS clusters and we upgraded the CNI plugin to the latest version in about a dozen of them. Initially we didn't see any issue, but after a couple of days we noticed that our logging ingestion (and cost) had gone thru the roof (we have centralized logging for /var/log/aws-routed-eni/ipamd.log
) and the aws-node
pod in these clusters was using WAY too much CPU.
We released an hot-patch to pin the version of the AWS CNI plugin to v1.16.2
across the whole fleet to prevent accidental upgrades in the plugin's version. But we prefer to not have to pin the versions of the AWS provided add-ons as they end up not being updated and over time, rot.
We run the AWS provided EKS AMI, not bottlerocket.
We do not have a "yank" mechanism, just documentation to recommend against updating or updating with caution. And then hopefully a quick turnaround in releasing a patched version.
For those, like us, using the AWS EKS Add-ons to manage this sort of components, this feels a sub-par solution. I think that the operators that handover the responsibility of managing this components to AWS are expecting a quick fix-process for identified issues like this one.
If a yank
mechanism is not available, a alternative would be a new release with a patch for the bug. Could that be a possibility?
As a side note, https://github.com/aws/amazon-vpc-cni-k8s/pull/2810 should merge soon and be released in v1.16.4, which is targeting mid-March.
Mid-March can be about two weeks to go - there's still potential for v1.26.3
to bring some headaches to EKS operators.
@PerGon Yep, we fully understand and are working on a patch release as soon as possible, so it will be before March 15th. Pulling a release from the field is a bad idea for many reasons, so our primary mechanisms for alerting are EKS console notices and release notes: https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.16.3
Understandable. Hence the suggestion:
an alternative would be a new release with a patch for the bug
Releasing something like a v1.16.3.1
or a v1.16.4
that would only include the patch. The point being, doing a release so that v1.16.3
stops being the latest. I would guess that the latest
version has a much higher probability of being used than the second to last.
But I understand a lot of other things could be in the way of doing that. Was just thinking out loud to try to help reducing the chance others would face the same issue by upgrading to the latest version. But I've seen the PR with the fix has been merged, fingers crossed a new release will come out soon 🤞
But I've seen the PR with the fix has been merged, fingers crossed a new release will come out soon 🤞
yes, we are targeting for a quick release.
Can confirm rolling back to v1.16.2-eksbuild.1
solved the issue for us.
We were seeing CPU Usage of around ~1200 milliCPU on v1.16.3-eksbuild.1
compared to ~3 milliCPU on v1.16.2-eksbuild.1
.
For those using terraform-eks-module, the following configuration will pin the VPC-CNI EKS add-on to v1.16.2-eksbuild.1
cluster_addons = {
vpc-cni = {
# https://github.com/aws/amazon-vpc-cni-k8s/issues/2807
addon_version = "v1.16.2-eksbuild.1"
# most_recent = true
}
}
We have this problem on 1.29 with m5.large instances (ami-0a5010afd9acfaa26). Downgrading to "v1.16.2-eksbuild.1" helped instantly.
VPC CNI v1.16.4
, which contains the fix for this issue, is now released: https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.16.4
The EKS Managed Addon should be available in all regions within the next 48 hours.
I am going to close this issue now that v1.16.4
is available, but please feel free to add any comments.
I am going to close this issue now that
v1.16.4
is available, but please feel free to add any comments.
We are not seeing v1.16.4
available in the EKS Add-on versions
Do you know when this is expected to ship to the EKS Add-on?
@jeremyruffell it takes ~24-48 hours after the GitHub release for the EKS Managed Addon to be available in all regions, with us-east-1
being the last region.
What happened:
I updated to v1.16.3. Specifically on
medium
instance size CPU usage byaws-node
pod is maxed outOn a
t3.medium
(I have also seen this happen on ac6g.medium
)On an
m7i.2xlarge
Attach logs
This log is spewing constantly in
/var/log/aws-routed-eni/ipamd.log
on themedium
other instance sizes are typical volume of logsWhat you expected to happen: Typical CPU usage for
medium
instancesHow to reproduce it (as minimally and precisely as possible): Spin up a
medium
size in an eks cluster with v1.16.3 cni. In my case it's at3.medium
Anything else we need to know?:
Environment:
kubectl version
):v1.29.0-eks-c417bb3
v1.16.3
cat /etc/os-release
):Bottlerocket OS 1.19.1 (aws-k8s-1.29)
uname -a
):6.1.72