Closed micahnoland closed 5 months ago
@micahnoland - Nice debugging! We were just discussing this scenario yesterday evening while working on https://github.com/aws/aws-network-policy-agent/issues/245
To summarize -
One simple way would be to allow the entry in local conntrack table one additional reconciler chance before cleanup but again if the timer is set to a very low value then we will end up in a similar situation..will think about this and get back...
Right now, we don't check the connection state and that potentially is resulting in this as well and this appears to be the RCA for #245 as well. We will check the logs and will address it soon.
Fix is released with network policy agent v1.1.2. - https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.18.2. Please test and let us know if there are any issues.
What happened: We are observing an intermittent issue with network-policy-agent denying return packets on an active flow, seemingly due to the flow being removed from the conntrack map. These connections originate from long-running pods and do not seem related to strict mode.
Perhaps we are misunderstanding the code, but it appears as though there is a race condition here, where:
CleanupConntrackMap()
starts{"level":"info","ts":"2024-04-10T20:38:07.026Z","logger":"ebpf-client","msg":"Flow Info: ","Src IP":"10.3.99.194","Src Port":33794,"Dest IP":"209.54.181.208","Dest Port":443,"Proto":"TCP","Verdict":"ACCEPT"}
{"level":"info","ts":"2024-04-10T20:38:07.126Z","logger":"ebpf-client","caller":"wait/backoff.go:227","msg":"Conntrack cleanup","Delete - ":"Conntrack Key : Source IP - 10.3.99.194 Source port - 33794 Dest IP - 209.54.181.208 Dest port - 443 Protocol - 6 Owner IP - 10.3.99.194"}
{"level":"info","ts":"2024-04-10T20:38:34.848Z","logger":"ebpf-client","msg":"Flow Info: ","Src IP":"209.54.181.208","Src Port":443,"Dest IP":"10.3.99.194","Dest Port":33794,"Proto":"TCP","Verdict":"DENY"}
{"level":"info","ts":"2024-04-10T20:39:07.032Z","logger":"ebpf-client","msg":"Flow Info: ","Src IP":"10.3.99.194","Src Port":33794,"Dest IP":"209.54.181.208","Dest Port":443,"Proto":"TCP","Verdict":"ACCEPT"}
Running
conntrack -E -o timestamp
during this time, we only see 3 lines for port33794
:I have the output of
sudo bash /opt/cni/bin/aws-cni-support.sh
as well and I'll send that along to k8s-awscni-triage@amazon.com.How to reproduce it (as minimally and precisely as possible): The issue is intermittent, since the outbound connection needs to be established during the short time that cleanup is running. We have observed this more easily in pods which establish many outbound connections.
Environment:
kubectl version
):Server Version: v1.28.6-eks-508b6b3
v1.18.0-eksbuild.1
v1.1.0-eksbuild.1
cat /etc/os-release
):NAME="Amazon Linux" VERSION="2"ID="amzn"ID_LIKE="centos rhel fedora"VERSION_ID="2" PRETTY_NAME="Amazon Linux 2" ANSI_COLOR="0;33" CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2" HOME_URL="https://amazonlinux.com/" SUPPORT_END="2025-06-30"
uname -a
):5.10.192-183.736.amzn2.x86_64 #1 SMP Wed Sep 6 21:15:41 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux