aws / aws-network-policy-agent

Apache License 2.0
45 stars 29 forks source link

Unexpected probe failures due to transient denied connections #305

Open steveteahan opened 2 months ago

steveteahan commented 2 months ago

What happened:

We have an application that is failing readiness and liveness probes because the traffic is being denied by NetPol agent. We've seen this across multiple versions including v1.1.0-eksbuild.1 and v1.1.2-eksbuild.1.

I was able to see that the network traffic was being denied in /var/log/aws-routed-eni/network-policy-agent.log. After some period of time, the traffic is accepted again and the application recovers.

What stuck out to me is that there are multiple PolicyEndpoints created. Our NP looks something like:

piVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  labels:
    app.kubernetes.io/name: <app>
  name: <app>
spec:
  ingress:
  - from:
    - namespaceSelector: {}
...

Think of use cases where all pods need to reach a core service. This results in multiple PEs:

% kubectl -n <namespace> get policyendpoint <pe-name-0> -o json | jq '.spec.ingress | length'
879

% kubectl -n <namespace> get policyendpoint <pe-name-1> -o json | jq '.spec.ingress | length'
684

% kubectl -n <namespace> get policyendpoint <pe-name-2> -o json | jq '.spec.ingress | length'
293

I tested that by changing namespaceSelector: {} to a rule like ipBlock.cidr: 0.0.0.0/0, the multiple PEs are removed and a single PE is created since every Pod in the cluster doesn't need to be enumerated in .spec.ingress. We haven't seen a single probe failure in a week after changing the configuration to remove the multiple PEs. This is compared to literally hundreds of failures over a couple of weeks.

It's also worth noting that this is an intermittent issue. The pattern we see is that the probes fail, the container is restarted, and then the service recovers. We'll see this anywhere from 1-5 times a day. Interestingly, we see this issue on a few of our clusters with ~2000 pods, but a relatively low pod churn rate. We never see container restarts on our cluster with ~3000 pods, but a higher churn rate due to heavy usage of CronJobs. I can see the Received a new reconcile request log line happening far more frequently in /var/log/aws-routed-eni/network-policy-agent.log on the cluster that's not experiencing this issue. This may still mean that any potential bug could still be occurring on that cluster, but the next reconciliation happens faster than the time it takes for the liveness probes to fail (~30s).

Attach logs

Logs were sent.

What you expected to happen:

Liveness / readiness probe traffic is not denied.

How to reproduce it (as minimally and precisely as possible):

  1. Create a cluster that has >1000-2000 pods to simulate multiple PE entries
  2. Configure an application with liveness probes, something like http-get http://some-endpoint delay=0s timeout=3s period=5s #success=1 #failure=6
  3. Configure a NetworkPolicy using namespaceSelector: {} for ingress rules
  4. Allow the application to run for some number of hours (again, we see this 0-3 times per day)

Anything else we need to know?:

Environment:

% kubectl version
Client Version: v1.29.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.11-eks-db838b0
$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
SUPPORT_END="2025-06-30"
$ uname -a
Linux eks-prod-dove-c-0fa30f4f6ba5af8be 5.10.219-208.866.amzn2.x86_64 #1 SMP Tue Jun 18 14:00:06 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
jayanthvn commented 2 months ago

@steveteahan - On one of the node with impacted pod, can you please check the SDK logs and see if you notice this line around the time of the event -

SDK logs location - /var/log/aws-routed-eni/ebpf-sdk.log

error: ":"unable to update map: invalid argument"}
steveteahan commented 1 month ago

@jayanthvn I'll have to find some time to reproduce the issue again. It may not be for a few days. I didn't get a chance to capture those logs originally, but I'll make sure to run the capture script on the next one.

steveteahan commented 1 month ago

@jayanthvn I haven't had as much time to reproduce in our development environment as I had hoped. Is this issue something that you also had a chance to reproduce at all? I'm concerned that this bug could prevent the usage of NetworkPolicy on foundational services that have many pods connecting to them.

jaydeokar commented 1 month ago

@steveteahan
There is one bug which we fixed in the latest v1.1.3 release where the IPs can get garbage collected when the SDK tries to make an update resulting in a traffic getting blocked. Have you tried with the latest version and see if you run into the same issue ?

steveteahan commented 1 month ago

@steveteahan There is one bug which we fixed in the latest v1.1.3 release where the IPs can get garbage collected when the SDK tries to make an update resulting in a traffic getting blocked. Have you tried with the latest version and see if you run into the same issue ?

I have not tried the latest version. Would that bug only present itself in scenarios where there is >1 PolicyEndpoint? We still have not experienced this issue since I changed the NetworkPolicy rule such that there is only a single PolicyEndpoint.