aws / amazon-vpc-cni-k8s

Networking plugin repository for pod networking in Kubernetes using Elastic Network Interfaces on AWS
Apache License 2.0
2.24k stars 727 forks source link

iptables contention between vpc-cni and kube-proxy #2948

Open duxing opened 1 month ago

duxing commented 1 month ago

What happened:

on an EKS cluster with many Services (1000 in my case) and many pods (300 pods), big iptable lead to long execution time for some iptable rules (5s+ seconds)

this leads to xtable contention between kube-proxy and vpc-cni, despite specifying -w:

# from kube-proxy logs 2024-06-06T16:55:52.525393865Z stderr F Another app is currently holding the xtables lock. Stopped waiting after 5s.

This race condition between kube-proxy and vpn-cni has lead to longer initialization time for vpc-cni and frequent pod crashes due to failing readiness check (60s delay + 3 * 10s interval). Related issue #2945

Using some of the logs from eks_i-0019a68d504566810_2024-06-06_1830-UTC_0.7.6.tar.gz to walk through this issue (uploaded, see "Attach logs" section) From the ipamd.log i can tell the pod was restarted 5 times by the time I collected the logs, the following part of logs overlap with the kube-proxy logs around the same time, leading to the contention.

from kube-proxy log. CONSECUTIVE DEBUG logs. at 2024-06-06T16:49:46:

...# the previous log entry is 3.5s ago 2024-06-06T16:49:46.155486643Z stderr F I0606 16:49:46.155398 1 trace.go:236] Trace[1591999613]: "iptables restore" (06-Jun-2024 16:49:41.141) (total time: 5013ms): 2024-06-06T16:49:46.155504715Z stderr F Trace[1591999613]: [5.013803896s] [5.013803896s] END 2024-06-06T16:49:46.1555071Z stderr F E0606 16:49:46.155435 1 proxier.go:1525] "Failed to execute iptables-restore" err=< 2024-06-06T16:49:46.155509335Z stderr F exit status 4: Another app is currently holding the xtables lock; still 4s 100000us time ahead to have a chance to grab the lock... 2024-06-06T16:49:46.1555116Z stderr F Another app is currently holding the xtables lock; still 3s 100000us time ahead to have a chance to grab the lock... 2024-06-06T16:49:46.155513512Z stderr F Another app is currently holding the xtables lock; still 2s 100000us time ahead to have a chance to grab the lock... 2024-06-06T16:49:46.155515504Z stderr F Another app is currently holding the xtables lock; still 1s 100000us time ahead to have a chance to grab the lock... 2024-06-06T16:49:46.15551722Z stderr F Another app is currently holding the xtables lock; still 0s 100000us time ahead to have a chance to grab the lock... 2024-06-06T16:49:46.155519156Z stderr F Another app is currently holding the xtables lock. Stopped waiting after 5s. 2024-06-06T16:49:46.15552093Z stderr F > 2024-06-06T16:49:46.155522876Z stderr F I0606 16:49:46.155451 1 proxier.go:803] "Sync failed" retryingTime="30s" 2024-06-06T16:49:46.155529192Z stderr F I0606 16:49:46.155463 1 proxier.go:792] "SyncProxyRules complete" elapsed="7.619290979s" ...# the next log entry is 0.7s after

from ipamd.log. CONSECUTIVE DEBUG logs. between 2024-06-06T16:49:41 and 2024-06-06T16:49:49

{"level":"debug","ts":"2024-06-06T16:49:41.142Z","caller":"networkutils/network.go:449","msg":"execute iptable rule : first SNAT rules for non-VPC outbound traffic"} {"level":"debug","ts":"2024-06-06T16:49:43.691Z","caller":"networkutils/network.go:449","msg":"rule nat/POSTROUTING rule first SNAT rules for non-VPC outbound traffic shouldExist true rule [-m comment --comment AWS SNAT CHAIN -j AWS-SNAT-CHAIN-0] exists false, err "} {"level":"debug","ts":"2024-06-06T16:49:49.028Z","caller":"networkutils/network.go:449","msg":"execute iptable rule : AWS-SNAT-CHAIN-0"}

Attach logs

I've got logs from running the cni log collection tool from 3 different instances that run into this issue:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

duxing commented 1 month ago

@orsenthil can you advise whether the lock contention is avoidable and what are the possible actionable solutions?

achevuru commented 1 month ago

@duxing VPC CNI validates API server connectivity as part of it’s bootup process and this check requires kube-proxy to setup required IPtable rules for kubernetes service (that represents the API server endpoints) and that can take longer when the total no.of service objects in the cluster are north of 1000. Explains the CNI pod's health check failures and restarts. That being said, this is limited to node startup when CNI and kube-proxy pods are coming up in parallel and shouldn’t be an issue on an active node. VPC CNI will not setup any IPtable rules for individual pods.

Workarounds:

orsenthil commented 1 month ago

@duxing - I missed the ping. As @achevuru mentiond, the contention, if observed, only during the aws-node (vpc cni) pod startup and not during the pods are running.

Kube-Proxy provides flags like (--iptables-min-sync-period duration, and --iptables-sync-period duration) https://kubernetes.io/docs/reference/command-line-tools-reference/kube-proxy/ that you can configure for time kube-proxy iptables sync, and tweaking those values can help here too.

duxing commented 1 month ago

hi @achevuru !

Thanks for the suggestions! I did some research after submitting this issue and planned to try ipvs mode as well (TBD next week). I'm not aware of the 1st workaround, will give it a try.

meanwhile, one thing that can be improved (IMO) from VPC CNI is logging / metrics.

If we can have more logs (maybe DEBUG logs) related to waiting on iptables update (with duration as part of the log message) and metrics for iptable update duration, pinpointing where the issue is would be significantly easier.

When this issue happened, the logs from VPC CNI has 0 warning logs and 0 error logs (everything is info or debug). It wasn't until a few days later I desperately checking other logs from the log collection tool did I realize kube-proxy was complaining about lock contention around the same time vpc-cni is running iptables operations.

Do you think it's reasonable for vpc-cni to be more verbose / transparent on blocking operations like this?

duxing commented 1 month ago

hi @orsenthil !

the contention, if observed, only during the aws-node (vpc cni) pod startup and not during the pods are running.

that's absolutely right. I noticed this as well: when this issue happened to new nodes, existing nodes are perfectly fine, even if they needed to assign new EIPs. I'll try to see if the EKS optimized kube-proxy addon allows me to specify those values with add-on configuration. After doing more research, I think testing IPVS would be a better fix for the issue I run into. I'll turn to these options if I have to stay with iptables mode.

thanks again for helping! @orsenthil @achevuru

achevuru commented 1 month ago

@duxing VPC CNI logs should show relevant error message if it runs in to IPtable contention issue. I believe in your case, VPC CNI had the lock and kube-proxy was running in to it and so you saw those error messages in kube-proxy logs. But we can enhance IPAMD logs where it waits for API server connectivity test to succeed. We will track that.

duxing commented 1 month ago

thx for confirming!

In case another entity acquired the lock first, vpc-cni would throw a warning log right?

what about adding a debug log for iptable update duration? this value can be calculated with multiple consecutive logs from the same instance but querying isn't easy to do.

if we have a single log entry, this duration can be queried easily and graphed to capture issues and history. e.g. : {"level":"debug","ts":"<timestamp>","caller":"networkutils/network.go:<line>","msg":"done execute iptable rule", "duration_ms": "3076", "rule": "xxx"}