Open duxing opened 1 month ago
@orsenthil can you advise whether the lock contention is avoidable and what are the possible actionable solutions?
@duxing VPC CNI validates API server connectivity as part of it’s bootup process and this check requires kube-proxy
to setup required IPtable rules for kubernetes
service (that represents the API server endpoints) and that can take longer when the total no.of service objects in the cluster are north of 1000. Explains the CNI pod's health check failures and restarts. That being said, this is limited to node startup when CNI and kube-proxy pods are coming up in parallel and shouldn’t be an issue on an active node. VPC CNI will not setup any IPtable rules for individual pods.
Workarounds:
kube-proxy
by passing the EKS Control plane’s NLB endpoint via [CLUSTER_ENDPOINT](https://github.com/aws/amazon-vpc-cni-k8s?tab=readme-ov-file#cluster_endpoint-v1121)
env variable and that will address the CNI restarts observed during node bootup. However, they can still potentially run in to IPtable contention during the initial initialization phase which can lead to longer node initialization time.kube-proxy
in IPVS mode.@duxing - I missed the ping. As @achevuru mentiond, the contention, if observed, only during the aws-node (vpc cni) pod startup and not during the pods are running.
Kube-Proxy provides flags like (--iptables-min-sync-period
duration, and --iptables-sync-period
duration) https://kubernetes.io/docs/reference/command-line-tools-reference/kube-proxy/ that you can configure for time kube-proxy iptables sync, and tweaking those values can help here too.
hi @achevuru !
Thanks for the suggestions! I did some research after submitting this issue and planned to try ipvs
mode as well (TBD next week).
I'm not aware of the 1st workaround, will give it a try.
meanwhile, one thing that can be improved (IMO) from VPC CNI is logging / metrics.
If we can have more logs (maybe DEBUG logs) related to waiting on iptables
update (with duration as part of the log message) and metrics for iptable update duration, pinpointing where the issue is would be significantly easier.
When this issue happened, the logs from VPC CNI has 0 warning logs and 0 error logs (everything is info or debug). It wasn't until a few days later I desperately checking other logs from the log collection tool did I realize kube-proxy
was complaining about lock contention around the same time vpc-cni
is running iptables
operations.
Do you think it's reasonable for vpc-cni
to be more verbose / transparent on blocking operations like this?
hi @orsenthil !
the contention, if observed, only during the aws-node (vpc cni) pod startup and not during the pods are running.
that's absolutely right. I noticed this as well: when this issue happened to new nodes, existing nodes are perfectly fine, even if they needed to assign new EIPs.
I'll try to see if the EKS optimized kube-proxy addon allows me to specify those values with add-on configuration. After doing more research, I think testing IPVS would be a better fix for the issue I run into. I'll turn to these options if I have to stay with iptables
mode.
thanks again for helping! @orsenthil @achevuru
@duxing VPC CNI logs should show relevant error message if it runs in to IPtable contention issue. I believe in your case, VPC CNI had the lock and kube-proxy
was running in to it and so you saw those error messages in kube-proxy
logs. But we can enhance IPAMD logs where it waits for API server connectivity test to succeed. We will track that.
thx for confirming!
In case another entity acquired the lock first, vpc-cni
would throw a warning log right?
what about adding a debug log for iptable update duration? this value can be calculated with multiple consecutive logs from the same instance but querying isn't easy to do.
if we have a single log entry, this duration can be queried easily and graphed to capture issues and history. e.g. :
{"level":"debug","ts":"<timestamp>","caller":"networkutils/network.go:<line>","msg":"done execute iptable rule", "duration_ms": "3076", "rule": "xxx"}
What happened:
on an EKS cluster with many
Service
s (1000 in my case) and many pods (300 pods), big iptable lead to long execution time for someiptable
rules (5s+ seconds)this leads to xtable contention between
kube-proxy
andvpc-cni
, despite specifying-w
:This race condition between
kube-proxy
andvpn-cni
has lead to longer initialization time forvpc-cni
and frequent pod crashes due to failing readiness check (60s delay + 3 * 10s interval). Related issue #2945Using some of the logs from
eks_i-0019a68d504566810_2024-06-06_1830-UTC_0.7.6.tar.gz
to walk through this issue (uploaded, see "Attach logs" section) From theipamd.log
i can tell the pod was restarted 5 times by the time I collected the logs, the following part of logs overlap with thekube-proxy
logs around the same time, leading to the contention.from
kube-proxy
log. CONSECUTIVE DEBUG logs. at 2024-06-06T16:49:46:from
ipamd.log
. CONSECUTIVE DEBUG logs. between 2024-06-06T16:49:41 and 2024-06-06T16:49:49Attach logs
I've got logs from running the cni log collection tool from 3 different instances that run into this issue:
eks_i-0130dc8295b19b0e3_2024-06-06_1901-UTC_0.7.6.tar.gz
andeks_i-0019a68d504566810_2024-06-06_1830-UTC_0.7.6.tar.gz
has been uploaded viafile="<filename>"; curl -s https://d1mg6achc83nsz.cloudfront.net/ebf57d09395e2150ac2485091ba7c48aa46181dbdcae78620987d3d7d36ace4b/us-east-1/$file | bash
eks_i-02c1cd4484684230c_2024-06-05_1932-UTC_0.7.6.tar.gz
has been emailed.What you expected to happen:
kube-proxy
is supposed to wait for actually5s
rather than saying5s
but just waited0.00001s
. If this is not expected, this is a problem withkube-proxy
addon from EKS.kube-proxy
fromv1.29.1-eksbuild.2
tov1.29.3-eksbuild.2
and noticed this issue. maybe it exists before as well.kube-proxy
may need to updateiptables
throughout its entire lifecycle so this contention may not be entirely avoidable. I'd love to know if it's feasible to tellvpc-cni
to wait for the part ofiptables
that's necessary for its own initialization.vpc-cni
run into a lock contention, it should spit out some logs about the situation as well as what it's going to do. "e.g.Another app is currently holding the xtables lock; wait for X seconds
toipamd
DEBUG
logger.How to reproduce it (as minimally and precisely as possible):
EKS@1.29
ami-0a5010afd9acfaa26
/amazon-eks-node-1.29-v20240227
r5.4xlarge
(EKS managed nodegroup)kube-proxy
:v1.29.3-eksbuild.2
vpc-cni
:v1.18.1-eksbuild.3
Anything else we need to know?:
Environment:
kubectl version
):v1.29.4-eks-036c24b
v1.18.1-eksbuild.3
cat /etc/os-release
):uname -a
):