aws / aws-network-policy-agent

Apache License 2.0
43 stars 27 forks source link

Network denies despite allow-all policy (strict mode) #288

Open creinheimer opened 1 month ago

creinheimer commented 1 month ago

Hello,

For several weeks we've been working on implementing network policies using the AWS solution. However, we've encountered various challenges along the way. Initially, we discovered that using the standard enforcement mode could lead to network instability. As a result, we decided to use the so called strict mode.

In this thread https://github.com/aws/aws-network-policy-agent/issues/271#issuecomment-2183294414, @achevuru suggested that we could create an allow all policy for each namespace and that the only side effect would be the deny mode during the first seconds of a newly launched pod. We then created an allow all policy on all namespaces and enabled the annotate Pod IP flag to allow faster network policy evaluations.

Now we have a new issue: pods in namespaces with an allow-all network policy are still experiencing network denies. This isn't limited to the initial startup period. It's happening long after pods have been running, sometimes hours later.

This behaviour is causing different problems, including pod crashes. In some cases, even the pod's internal health checks are being denied, triggering unnecessary restarts.

Can you provide any insight into why this might be happening? Am I missing something?

More info:

Deny logs from pods to control-plane ![image](https://github.com/user-attachments/assets/037ac8ec-5256-4d26-8e24-809b4f88d7da)
Deny logs from pods to pods on same namespace ![image](https://github.com/user-attachments/assets/f950adbe-11b4-46a2-9127-daccafddcc93)

These are just a few of them. We had approx. 200 denies over last 15 minutes.

NetworkPolicy allow-all ```yaml apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: finalizers: - networking.k8s.aws/resources name: allow-all namespace: kube-prometheus-stack spec: egress: - {} ingress: - {} podSelector: {} policyTypes: - Ingress - Egress ```

Environment:

Our AWS-CNI uses the default helm-chart with the following variables:

AWS-CNI configuration ```yaml env: ENABLE_PREFIX_DELEGATION: "true" AWS_VPC_K8S_PLUGIN_LOG_FILE: stderr AWS_VPC_K8S_PLUGIN_LOG_LEVEL: DEBUG AWS_VPC_K8S_CNI_LOG_FILE: stdout AWS_VPC_K8S_CNI_LOGLEVEL: DEBUG NETWORK_POLICY_ENFORCING_MODE: strict ANNOTATE_POD_IP: "true" ```

Note:

@jayanthvn that's follow-up of https://github.com/aws/aws-network-policy-agent/issues/73#issuecomment-2214634920.

achevuru commented 1 month ago

@creinheimer If I understood the issue accurately, you've pods that are only configured with an allow all policy but they're still denying all traffic? If yes, is this specific to few pods or you observe this behavior across all pods in your cluster?

So, the issue with standard mode you referenced above is tied to the flows that start during the first few seconds of a new pod launch I assume? ANNOTATE_POD_IP should help bring down the NP reconciliation latency to under 1s in standard mode.

creinheimer commented 1 month ago

Hi @achevuru,

I mentioned the other issues to provide you some context. ANNOTATE_POD_IP is already configured.

If I understood the issue accurately, you've pods that are only configured with an allow all policy but they're still denying all traffic? If yes, is this specific to few pods or you observe this behaviour across all pods in your cluster?

Yes. That happens sporadically on different pods even though we have an allow-all rule on all namespaces.

I would suggest we focus on understanding why denials occur sporadically (sometimes hours after pods have been running) despite having an allow-all rule applied to all namespaces.

pelzerim commented 4 weeks ago

Hi, we are experiencing a similar issue with STRICT mode + ANNOTATE_POD_IP. We also have a allow all policy.

Pods can start and are unable to connect to any host. They end up in a crash loop (due to timeouts in the app) and never recover. Only removing the pods manually does resolve this issue.

We moved to strict mode as we were experiencing dropped connections with workloads shortly after pod start.

These are the network-policy-agent.log logs network-policy-agent.log. The pod name is workload-dxl6g. I've also attached aws-eks-na-cli outputs.

We can easily reproduce this.

[edit] Some more information. We have extreme pod churn (pod lifetime 5-10 seconds) and this issue affects roughly 25% of pods. We had do move away from strict mode and now are going for the ANNOTATE_POD_IP + an init container that literally watches for "Successfully attached.*$${POD_NAME}" in the agent's logs.

I am happy to supply any debugging information go help resolve this.

anshulpatel25 commented 3 weeks ago

Hello @pelzerim,

We are also getting the same behaviour as our use case also involves a short pod lifecycle of 10 - 15 seconds.

The init container workaround that you have currently, is it 100% effective? or you still observing issues after that workaround?

Thanks !

pelzerim commented 3 weeks ago

The init container workaround that you have currently, is it 100% effective? or you still observing issues after that workaround?

Hey @anshulpatel25, the init container workaround does only work for standard mode. We've determined that its not actually the log line that does the magic but the minimum wait time of 1 second.

Unrelated to that issue, strict mode seems to be currently incompatible with high pod churn (see my previous comment)