Open younsl opened 3 months ago
We will look into this and get back. Btw is this easily reproducible?
@jayanthvn Yes, reproducible. But intermittent, it may take time for it to recur.
Thanks. We will review the logs and get back to you.
Our team found workarounds for two network policy issues through internal testing today.
Maybe it is not a fundamental solution.
Note: Sensitive information such as IP address and application name has been REDACT
<REDACTED>
.
Symptom 1. Intermittent connection reset by peer is resolved.
bpffs
Symptom 2. Delayed readiness time is resolved.
172.20.0.0/16
, which is the Kubernetes service IP range.As mentioned in Amazon EKS official documentation, intermitten connection reset does not occur when bpffs is mounted on an EC2 Worker Node.
Check the Kernel version and AMI version of the worker node.
The workload pod(source) that was experiencing the intermitten connection reset symptom was scheduled on the worker node.
$ kubectl get node -o wide ip-xx-xxx-xx-98.ap-northeast-2.compute.internal
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ip-xx-xxx-xx-98.ap-northeast-2.compute.internal Ready <none> 69d v1.26.12-eks-5e0fdde xx.xxx.xx.98 <none> Amazon Linux 2 5.10.205-195.804.amzn2.x86_64 containerd://1.7.2
Connect to the worker node and manually mount bpffs
:
# Mount bpf filesystem in worker node
sudo mount -t bpf bpffs /sys/fs/bpf
$ mount -l | grep bpf
none on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)
none on /sys/fs/bpf type bpf (rw,relatime)
none on /sys/fs/bpf type bpf (rw,relatime)
none on /sys/fs/bpf type bpf (rw,relatime)
Question: Even though prerequisites are met, why do I have to mount the BPF filesystembpffs on the worker node to resolve the symptom?
Prerequisite for network policy of VPC CNI For all other cluster versions, if you upgrade the Amazon EKS optimized Amazon Linux to version v20230703 or later or you upgrade the Bottlerocket AMI to version v1.0.2 or later, you can skip this step.
Reference: EKS User guide
Since mounting bpffs
before work on April 11th, the read ECONNRESET
error has not occurred.
Add an Ingress netpol to the workload pod (source).
This Ingress netpol explicitly allows ingress from 172.20.0.0/16
, which is the Kubernetes service IP range.
Network policy enforcing mode is set to standard
, as the default.
$ kubectl get ds -n kube-system aws-node -o yaml
containers:
- env:
- name: NETWORK_POLICY_ENFORCING_MODE
value: standard
name: aws-node
Create a new ingress netpol that ‘explicitly’ allows ingress from the Kubernetes Service IP range.
$ kubectl get netpol -n <REDACTED> ingress-service -o yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
...
spec:
ingress:
- from:
- ipBlock:
cidr: 172.20.0.0/16
ports:
- endPort: 65535
port: 1
protocol: TCP
podSelector:
matchLabels:
app.kubernetes.io/networkpolicy-ingress-service: apply
policyTypes:
- Ingress
status: {}
$ kubectl get pod -n <REDACTED> t<REDACTED> -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
kubectl.kubernetes.io/restartedAt: "2024-04-12T12:30:49+09:00"
creationTimestamp: "2024-04-12T06:14:02Z"
generateName: t<REDACTED>-cc878cb69-
labels:
...
app.kubernetes.io/networkpolicy-ingress-service: apply
app.kubernetes.io/networkpolicy-ingress-t<REDACTED>: apply
pod-template-hash: cc878cb69
To explicitly allow Kubernetes service IP range, added app.kubernetes.io/networkpolicy-ingress-service
to the Pod label.
Delayed readiness time issue was resolved after explicitly attaching the ingress netpol to the workload pod(source) as shown in this comment.
$ tail -f /var/log/aws-routed-eni/network-policy-agent.log | egrep 'xx.xxx.xx.159'
{"level":"info","ts":"2024-04-12T06:09:36.079Z","logger":"ebpf-client","msg":"Flow Info: ","Src IP":"xx.xxx.xx.159","Src Port":43798,"Dest IP":"172.20.0.10","Dest Port":53,"Proto":"UDP","Verdict":"ACCEPT"}
{"level":"info","ts":"2024-04-12T06:09:36.080Z","logger":"ebpf-client","msg":"Flow Info: ","Src IP":"xx.xxx.xx.159","Src Port":53898,"Dest IP":"172.20.0.10","Dest Port":53,"Proto":"UDP","Verdict":"ACCEPT"}
{"level":"info","ts":"2024-04-12T06:09:36.081Z","logger":"ebpf-client","msg":"Flow Info: ","Src IP":"xx.xxx.xx.159","Src Port":55208,"Dest IP":"172.20.67.165","Dest Port":80,"Proto":"TCP","Verdict":"ACCEPT"}
{"level":"info","ts":"2024-04-12T06:09:41.037Z","logger":"ebpf-client","msg":"Flow Info: ","Src IP":"xx.xxx.xx.159","Src Port":56070,"Dest IP":"172.20.0.10","Dest Port":53,"Proto":"UDP","Verdict":"ACCEPT"}
{"level":"info","ts":"2024-04-12T06:09:41.037Z","logger":"ebpf-client","msg":"Flow Info: ","Src IP":"xx.xxx.xx.159","Src Port":46937,"Dest IP":"172.20.0.10","Dest Port":53,"Proto":"UDP","Verdict":"ACCEPT"}
... No deny logs from kubernetes service IP starting with 172.20.67.165 ...
{"level":"info","ts":"2024-04-12T06:10:03.016Z","logger":"ebpf-client","caller":"ebpf/bpf_client.go:680","msg":"Updating Map with ","IP Key:":"xx.xxx.xx.159/32"}
{"level":"info","ts":"2024-04-12T06:10:03.294Z","logger":"ebpf-client","caller":"ebpf/bpf_client.go:680","msg":"Updating Map with ","IP Key:":"xx.xxx.xx.159/32"}
{"level":"info","ts":"2024-04-12T06:10:03.480Z","logger":"ebpf-client","caller":"ebpf/bpf_client.go:680","msg":"Updating Map with ","IP Key:":"xx.xxx.xx.159/32"}
{"level":"info","ts":"2024-04-12T06:10:03.528Z","logger":"ebpf-client","caller":"ebpf/bpf_client.go:680","msg":"Updating Map with ","IP Key:":"xx.xxx.xx.159/32"}
{"level":"info","ts":"2024-04-12T06:10:03.678Z","logger":"ebpf-client","caller":"ebpf/bpf_client.go:680","msg":"Updating Map with ","IP Key:":"xx.xxx.xx.159/32"}
{"level":"info","ts":"2024-04-12T06:10:03.716Z","logger":"ebpf-client","caller":"ebpf/bpf_client.go:680","msg":"Updating Map with ","IP Key:":"xx.xxx.xx.159/32"}
{"level":"info","ts":"2024-04-12T06:10:11.661Z","logger":"ebpf-client","msg":"Flow Info: ","Src IP":"ss.sss.34.98","Src Port":38850,"Dest IP":"xx.xxx.xx.159","Dest Port":3000,"Proto":"TCP","Verdict":"ACCEPT"}
{"level":"info","ts":"2024-04-12T06:10:13.554Z","logger":"ebpf-client","msg":"Flow Info: ","Src IP":"ss.sss.10.163","Src Port":19310,"Dest IP":"xx.xxx.xx.159","Dest Port":3000,"Proto":"TCP","Verdict":"ACCEPT"}
{"level":"info","ts":"2024-04-12T06:10:13.606Z","logger":"ebpf-client","msg":"Flow Info: ","Src IP":"ss.sss.11.99","Src Port":22244,"Dest IP":"xx.xxx.xx.159","Dest Port":3000,"Proto":"TCP","Verdict":"ACCEPT"}
{"level":"info","ts":"2024-04-12T06:10:14.004Z","logger":"ebpf-client","msg":"Flow Info: ","Src IP":"ss.sss.29.161","Src Port":52804,"Dest IP":"xx.xxx.xx.159","Dest Port":3000,"Proto":"TCP","Verdict":"ACCEPT"}
Captured conntrack list between workload pod(source, ends with 35.242
) and cluster IP for destination pod(172.20.67.165
)
# Run on the worker node where the source pod is scheduled
$ conntrack -L --src ss.sss.35.242 --dst 172.20.67.165
tcp 6 118 TIME_WAIT src=ss.sss.35.242 dst=172.20.67.165 sport=55938 dport=80 src=ss.sss.19.208 dst=ss.sss.35.242 sport=8080 dport=55938 [ASSURED] mark=0 use=1
tcp 6 431998 ESTABLISHED src=ss.sss.35.242 dst=172.20.67.165 sport=55944 dport=80 src=ss.sss.21.1 dst=ss.sss.35.242 sport=8080 dport=55944 [ASSURED] mark=0 use=1
conntrack v1.4.4 (conntrack-tools): 2 flow entries have been shown.
I observed with conntrack -L
command that connection ESTABLISHED
rapidly between workload pod(source) and destination pod(service IP).
# Run on the worker node where the source pod is scheduled
$ conntrack -L --src ss.sss.35.242 --dst 172.20.67.165
tcp 6 98 TIME_WAIT src=ss.sss.35.242 dst=172.20.67.165 sport=55938 dport=80 src=ss.sss.19.208 dst=ss.sss.35.242 sport=8080 dport=55938 [ASSURED] mark=0 use=1
tcp 6 102 TIME_WAIT src=ss.sss.35.242 dst=172.20.67.165 sport=55944 dport=80 src=ss.sss.21.1 dst=ss.sss.35.242 sport=8080 dport=55944 [ASSURED] mark=0 use=1
conntrack v1.4.4 (conntrack-tools): 2 flow entries have been shown.
It was observed that the ready time of the pod was dramatically reduced from 92 seconds to 32 seconds.
$ kubectl get pod -n <REDACTED> -l app.kubernetes.io/name=t<REDACTED>
NAME READY STATUS RESTARTS AGE
t<REDACTED>-cc878cb69-8tmt9 1/1 Running 0 33s
At this point, Readiness time for the workload(source) pod back to normal.
The readiness time comparison is:
Source Pod | Egress netpol | Time to readiness | Status |
---|---|---|---|
Workload Pod | egress netpol, without ingress netpol | 92s | Delayed |
Workload Pod | egress netpol and attached ingress netpol newly | 32s | Normal |
@jayanthvn @achevuru I submitted a node-level support bundles to k8s-awscni-triage@amazon.com.
@younsl Thanks for sharing your findings with us.
For delayed readiness time issue :- It is expected as we discussed on the call (i.e.,) if the pod attempts to start a connection before the eBPF probes are configured against the pod interface. Response packets can potentially be dropped if the probes are setup before the response packet reaches the source pod. Refer to this comment for detailed explanation.
Our recommended solution for this is Strict
mode, which will gate pod launch until policies are configured against the newly launched pod. If you don't want to migrate to Strict
mode due to other limitations, then you can probably consider workarounds included in the above comment. With your workaround, you're explicitly allowing Service CIDR and so that explicitly allows the return packet coming from any cluster local service there by bypassing the conntrack requirement. If you're Ok with this ingress rule then this is a viable workaround for in cluster traffic assuming your pods only talk to other pods in the cluster via Service VIPs.
Intermittent connection reset by peer :- As you can see, there were already multiple BPF FS mounts in your command o/p (mount -l | grep bpf
) and there is no need to mount the BPF FS again. If BPF FS isn't mounted then nothing will work and Network policy agent will crash. So, this shouldn't fix anything and considering your issue is intermittent it is probably pure chance that you haven't run in to it in the last 1 day. We believe you're running in to the same issue that is called out in #246 . Basically if a new connection is in the process of being established right at the time when the conntrack clean up routine runs, NP agent can incorreclty expire the entry away from it's own conntrack table leading to intermittent connection resets and the subsequent retry addresses this. We will address this asap.
@achevuru So, this shouldn't fix anything and considering your issue is intermittent it is probably pure chance that you haven't run in to it in the last 1 day.
[updated] Yes, you're right.
Even though the file system bpffs
was mounted on all worker nodes, the Intermitten connection reset by peer symptom recurred 3 days after the workaround was applied.
Anything in progress for next release? I found updated progress https://github.com/aws/aws-network-policy-agent/issues/204#issuecomment-2085894403, https://github.com/aws/aws-network-policy-agent/pull/256.
Fix is released with network policy agent v1.1.2. - https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.18.2. Please test and let us know if there are any issues.
Hi, @jayanthvn. I'm still experiencing intermittent connection resets in network-policy-agent:v1.1.2-eksbuild.1
.
v1.18.2-eksbuild.1
with network-policy-agent v1.1.2-eksbuild.1
Tested the two changes below to resolve the read ECONNRESET
error that occurs intermittently in some pods. Observed after change from Jun 7, 17:46 (KST).
v1.18.1
to v1.18.2-eksbuild.1
and upgrade the network-policy-agent
container version from v1.1.1
to v1.1.2
as well.conntrack-cache-cleanup-period
argument from the default value of 300
(5m) to 21600
(6h).VPC CNI version info:
$ kubectl describe daemonset aws-node -n kube-system | grep Image | cut -d "/" -f 2-3
amazon-k8s-cni-init:v1.18.2-eksbuild.1
amazon-k8s-cni:v1.18.2-eksbuild.1
amazon/aws-network-policy-agent:v1.1.2-eksbuild.1
NETWORK_POLICY_ENFORCING_MODE setting currently defaults to standard
, not strict
.
Diff for conntrack-cache-cleanup-period argument:
- args:
- - --conntrack-cache-cleanup-period=300 # 5m (default)
+ - --conntrack-cache-cleanup-period=21600 # 6h
However, even after upgrading to VPC CNI v1.18.2-eksbuild.1
, which includes network-policy-agent v1.1.2
, some application pods are still experiencing intermittent packet drops, such as [read ECONNRESET](err: server not responding (read ECONNRESET))
errors.
After checking internally with the developer, it was found that the retry logic for connection failures is not included in the application container affected by the network issues.
Timeline of read ECONNRESET errors occurring in some pods (starting after upgrading to network-policy-agent v1.1.2):
[!NOTE]
( )
indicate the time difference from the last occurrence.- All times listed in the timeline below are in KST (UTC+09:00).
6/10 08:00 (6h)
6/10 02:00 (7h 39m)
6/9 18:21 (2m)
6/9 18:19 (6m)
6/9 18:13 (1h 36m)
6/9 16:37 (1h 25m)
6/9 15:12 (7h 57m)
6/9 07:15 (2h 38m)
6/9 04:37 (12h 26m)
6/8 16:11 (4h 23m)
6/8 11:48 (5h 34m)
6/8 06:14
@younsl - as shared internally with the service ticket, the timeouts are not inline with conntrack cleanup since you have the cleanup every 6 hours while timeout is happening at varied times. At these times are you noticing a spike in network policy agent conntrack cache? One suspect is cache is getting full leading to certain entries getting evicted and causing timeouts..
@jayanthvn I'm waiting to test PR #280 on my affected clusters.
I enabled network-policy-agent log on my dev EKS v1.28 cluster.
VPC CNI yaml:
- args:
- --enable-ipv6=false
- --enable-network-policy=true
- --enable-cloudwatch-logs=false
- - --enable-policy-event-logs=false
+ - --enable-policy-event-logs=true
So I will back to submit conntrack cleanup log for network-policy-agent ASAP.
v1.18.2-eksbuild.1
$ kubectl describe daemonset aws-node -n kube-system | grep Image | cut -d "/" -f 2-3
amazon-k8s-cni-init:v1.18.2-eksbuild.1
amazon-k8s-cni:v1.18.2-eksbuild.1
aws-network-policy-agent:v1.1.1-13-gda05900-dirty
What happened:
Background
After migrating the network policy provider from Calico v3.25.1 and Tigera Operator to VPC CNI
v1.18.0-eksbuild.1
, the following two network policy issues occurred on EKS v1.26 cluster:Cluster environment
{"enableNetworkPolicy":"true"}
setting in advanced configurationiptables
mode.Network policy issues
1. Intermittent connection reset by peer
occurs during Pod to Pod or Pod to EC2 communication.
Similar issues: https://github.com/aws/aws-network-policy-agent/issues/204, https://github.com/aws/aws-network-policy-agent/issues/210, https://github.com/aws/aws-network-policy-agent/issues/236
2. Delayed Running time
Delay in the time it takes for the pod to run.
For pods to which Network Policy is applied, the time it takes to activate the Readiness Probe is up to 3 times slower.
Similar issues: https://github.com/aws/aws-network-policy-agent/issues/189, https://github.com/aws/aws-network-policy-agent/issues/186
Attach logs
1. Intermittent connection reset by peer
[tcpdump] From workload pod to EC2 instance
Intermittently, the workload pod receives a resetRST packet response from EC2.
[kubectl sniff] From workload pod to EC2 instance
intermittently, the workload pod receives a resetRST packet response from EC2.
If a issue occurs in the workload pod, the slack notification below is output.
2. Delayed Running time
Captured ebpf-sdk log on worker node immediately after pod restart
A Deny log occurs from the destination Service (Cluster IP) IP
172.20.67.165
.What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
The same network issue occurred in all VPC CNI v1.16.0, v1.16.1, and v1.18.0 versions.
Environment:
kubectl version
): v1.26.12-eks-5e0fddecat /etc/os-release
): Amazon Linux release 2 (Karoo)uname -a
): 5.10.205-195.804.amzn2.x86_64