iptables contention between vpc-cni and kube-proxy

duxing commented 1 month ago

What happened:

on an EKS cluster with many Services (1000 in my case) and many pods (300 pods), big iptable lead to long execution time for some iptable rules (5s+ seconds)

this leads to xtable contention between kube-proxy and vpc-cni, despite specifying -w:

# from kube-proxy logs 2024-06-06T16:55:52.525393865Z stderr F Another app is currently holding the xtables lock. Stopped waiting after 5s.

This race condition between kube-proxy and vpn-cni has lead to longer initialization time for vpc-cni and frequent pod crashes due to failing readiness check (60s delay + 3 * 10s interval). Related issue #2945

Using some of the logs from eks_i-0019a68d504566810_2024-06-06_1830-UTC_0.7.6.tar.gz to walk through this issue (uploaded, see "Attach logs" section) From the ipamd.log i can tell the pod was restarted 5 times by the time I collected the logs, the following part of logs overlap with the kube-proxy logs around the same time, leading to the contention.

from kube-proxy log. CONSECUTIVE DEBUG logs. at 2024-06-06T16:49:46:

...# the previous log entry is 3.5s ago 2024-06-06T16:49:46.155486643Z stderr F I0606 16:49:46.155398 1 trace.go:236] Trace[1591999613]: "iptables restore" (06-Jun-2024 16:49:41.141) (total time: 5013ms): 2024-06-06T16:49:46.155504715Z stderr F Trace[1591999613]: [5.013803896s] [5.013803896s] END 2024-06-06T16:49:46.1555071Z stderr F E0606 16:49:46.155435 1 proxier.go:1525] "Failed to execute iptables-restore" err=< 2024-06-06T16:49:46.155509335Z stderr F exit status 4: Another app is currently holding the xtables lock; still 4s 100000us time ahead to have a chance to grab the lock... 2024-06-06T16:49:46.1555116Z stderr F Another app is currently holding the xtables lock; still 3s 100000us time ahead to have a chance to grab the lock... 2024-06-06T16:49:46.155513512Z stderr F Another app is currently holding the xtables lock; still 2s 100000us time ahead to have a chance to grab the lock... 2024-06-06T16:49:46.155515504Z stderr F Another app is currently holding the xtables lock; still 1s 100000us time ahead to have a chance to grab the lock... 2024-06-06T16:49:46.15551722Z stderr F Another app is currently holding the xtables lock; still 0s 100000us time ahead to have a chance to grab the lock... 2024-06-06T16:49:46.155519156Z stderr F Another app is currently holding the xtables lock. Stopped waiting after 5s. 2024-06-06T16:49:46.15552093Z stderr F > 2024-06-06T16:49:46.155522876Z stderr F I0606 16:49:46.155451 1 proxier.go:803] "Sync failed" retryingTime="30s" 2024-06-06T16:49:46.155529192Z stderr F I0606 16:49:46.155463 1 proxier.go:792] "SyncProxyRules complete" elapsed="7.619290979s" ...# the next log entry is 0.7s after

from ipamd.log. CONSECUTIVE DEBUG logs. between 2024-06-06T16:49:41 and 2024-06-06T16:49:49

{"level":"debug","ts":"2024-06-06T16:49:41.142Z","caller":"networkutils/network.go:449","msg":"execute iptable rule : first SNAT rules for non-VPC outbound traffic"} {"level":"debug","ts":"2024-06-06T16:49:43.691Z","caller":"networkutils/network.go:449","msg":"rule nat/POSTROUTING rule first SNAT rules for non-VPC outbound traffic shouldExist true rule [-m comment --comment AWS SNAT CHAIN -j AWS-SNAT-CHAIN-0] exists false, err "} {"level":"debug","ts":"2024-06-06T16:49:49.028Z","caller":"networkutils/network.go:449","msg":"execute iptable rule : AWS-SNAT-CHAIN-0"}

Attach logs

I've got logs from running the cni log collection tool from 3 different instances that run into this issue:

eks_i-0130dc8295b19b0e3_2024-06-06_1901-UTC_0.7.6.tar.gz and eks_i-0019a68d504566810_2024-06-06_1830-UTC_0.7.6.tar.gz has been uploaded via file="<filename>"; curl -s https://d1mg6achc83nsz.cloudfront.net/ebf57d09395e2150ac2485091ba7c48aa46181dbdcae78620987d3d7d36ace4b/us-east-1/$file | bash
eks_i-02c1cd4484684230c_2024-06-05_1932-UTC_0.7.6.tar.gz has been emailed.

What you expected to happen:

I'm not sure if kube-proxy is supposed to wait for actually 5s rather than saying 5s but just waited 0.00001s. If this is not expected, this is a problem with kube-proxy addon from EKS.
- I recently upgrade kube-proxy from v1.29.1-eksbuild.2 to v1.29.3-eksbuild.2 and noticed this issue. maybe it exists before as well.
kube-proxy may need to update iptables throughout its entire lifecycle so this contention may not be entirely avoidable. I'd love to know if it's feasible to tell vpc-cni to wait for the part of iptables that's necessary for its own initialization.
if vpc-cni run into a lock contention, it should spit out some logs about the situation as well as what it's going to do. "e.g. Another app is currently holding the xtables lock; wait for X seconds to ipamd DEBUG logger.

How to reproduce it (as minimally and precisely as possible):

EKS@1.29
instance AMI: ami-0a5010afd9acfaa26 / amazon-eks-node-1.29-v20240227
instance class: r5.4xlarge (EKS managed nodegroup)
EKS addons:
- kube-proxy: v1.29.3-eksbuild.2
- vpc-cni: v1.18.1-eksbuild.3

addon configs:

# vpc-cni only. kube-proxy uses default values
config:
env:
ENABLE_PREFIX_DELEGATION: "true"
WARM_PREFIX_TARGET: "1"
resources:
requests:
  cpu: 100m
  memory: 512Mi
init:
resources:
  requests:
    cpu: 25m
nodeAgent:
resources:
  requests:
    cpu: 25m
    memory: 64Mi

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): v1.29.4-eks-036c24b
CNI Version: v1.18.1-eksbuild.3
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):

duxing commented 1 month ago

@orsenthil can you advise whether the lock contention is avoidable and what are the possible actionable solutions?

achevuru commented 1 month ago

@duxing VPC CNI validates API server connectivity as part of it’s bootup process and this check requires kube-proxy to setup required IPtable rules for kubernetes service (that represents the API server endpoints) and that can take longer when the total no.of service objects in the cluster are north of 1000. Explains the CNI pod's health check failures and restarts. That being said, this is limited to node startup when CNI and kube-proxy pods are coming up in parallel and shouldn’t be an issue on an active node. VPC CNI will not setup any IPtable rules for individual pods.

Workarounds:

We can remove VPC CNI’s dependency on kube-proxy by passing the EKS Control plane’s NLB endpoint via [CLUSTER_ENDPOINT](https://github.com/aws/amazon-vpc-cni-k8s?tab=readme-ov-file#cluster_endpoint-v1121) env variable and that will address the CNI restarts observed during node bootup. However, they can still potentially run in to IPtable contention during the initial initialization phase which can lead to longer node initialization time.
Configure kube-proxy in IPVS mode.

orsenthil commented 1 month ago

@duxing - I missed the ping. As @achevuru mentiond, the contention, if observed, only during the aws-node (vpc cni) pod startup and not during the pods are running.

Kube-Proxy provides flags like (--iptables-min-sync-period duration, and --iptables-sync-period duration) https://kubernetes.io/docs/reference/command-line-tools-reference/kube-proxy/ that you can configure for time kube-proxy iptables sync, and tweaking those values can help here too.

duxing commented 1 month ago

hi @achevuru !

Thanks for the suggestions! I did some research after submitting this issue and planned to try ipvs mode as well (TBD next week). I'm not aware of the 1st workaround, will give it a try.

meanwhile, one thing that can be improved (IMO) from VPC CNI is logging / metrics.

If we can have more logs (maybe DEBUG logs) related to waiting on iptables update (with duration as part of the log message) and metrics for iptable update duration, pinpointing where the issue is would be significantly easier.

When this issue happened, the logs from VPC CNI has 0 warning logs and 0 error logs (everything is info or debug). It wasn't until a few days later I desperately checking other logs from the log collection tool did I realize kube-proxy was complaining about lock contention around the same time vpc-cni is running iptables operations.

Do you think it's reasonable for vpc-cni to be more verbose / transparent on blocking operations like this?

duxing commented 1 month ago

hi @orsenthil !

the contention, if observed, only during the aws-node (vpc cni) pod startup and not during the pods are running.

that's absolutely right. I noticed this as well: when this issue happened to new nodes, existing nodes are perfectly fine, even if they needed to assign new EIPs. I'll try to see if the EKS optimized kube-proxy addon allows me to specify those values with add-on configuration. After doing more research, I think testing IPVS would be a better fix for the issue I run into. I'll turn to these options if I have to stay with iptables mode.

thanks again for helping! @orsenthil @achevuru

achevuru commented 1 month ago

@duxing VPC CNI logs should show relevant error message if it runs in to IPtable contention issue. I believe in your case, VPC CNI had the lock and kube-proxy was running in to it and so you saw those error messages in kube-proxy logs. But we can enhance IPAMD logs where it waits for API server connectivity test to succeed. We will track that.

duxing commented 1 month ago

thx for confirming!

In case another entity acquired the lock first, vpc-cni would throw a warning log right?

what about adding a debug log for iptable update duration? this value can be calculated with multiple consecutive logs from the same instance but querying isn't easy to do.

if we have a single log entry, this duration can be queried easily and graphed to capture issues and history. e.g. : {"level":"debug","ts":"<timestamp>","caller":"networkutils/network.go:<line>","msg":"done execute iptable rule", "duration_ms": "3076", "rule": "xxx"}

aws / amazon-vpc-cni-k8s

iptables contention between vpc-cni and kube-proxy #2948