aws / aws-network-policy-agent

Apache License 2.0
45 stars 29 forks source link

unexpected DENY entries #163

Closed ed4wg closed 8 months ago

ed4wg commented 10 months ago

What happened:

In reviewing the logs, I see some unexpected DENY entries in the wrong direction.

In this example, pod A is connecting to pod B on port 5003. pod A only talks to pod B and not the other way around. I believe this particular traffic is coming from a liveness probe that runs every 5 minutes. The probe on pod A runs a command that makes a simple GET request to pod B via code. In the example below, you can see the first connection works, but then 5 minutes later there are a bunch of DENY statements (Frompod B -> pod A) followed by an ACCEPT in the correct direction. This pattern is repeating itself over and over in the logs at the same interval.

I'm also seeing other examples of logs show up with the same issue. A DENY where the target pod shows as the SRC instead of the DEST.

I also don't see any errors in the application logs themselves. i.e. I'm not seeing any actual problems from the app running on pod A and pod B.

Example

2023-12-06T14:25:58.455-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.32.157;SPORT: 60476;DIP: 172.27.38.227;DPORT: 5003;PROTOCOL: TCP;PolicyVerdict: ACCEPT
2023-12-06T14:25:58.466-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.32.157;SPORT: 60476;DIP: 172.27.38.227;DPORT: 5003;PROTOCOL: TCP;PolicyVerdict: ACCEPT
2023-12-06T14:30:28.869-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.38.227;SPORT: 5003;DIP: 172.27.32.157;DPORT: 60476;PROTOCOL: TCP;PolicyVerdict: DENY
2023-12-06T14:30:28.879-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.38.227;SPORT: 5003;DIP: 172.27.32.157;DPORT: 60476;PROTOCOL: TCP;PolicyVerdict: DENY
2023-12-06T14:30:28.892-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.38.227;SPORT: 5003;DIP: 172.27.32.157;DPORT: 60476;PROTOCOL: TCP;PolicyVerdict: DENY
2023-12-06T14:30:29.037-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.38.227;SPORT: 5003;DIP: 172.27.32.157;DPORT: 60476;PROTOCOL: TCP;PolicyVerdict: DENY
2023-12-06T14:30:29.458-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.38.227;SPORT: 5003;DIP: 172.27.32.157;DPORT: 60476;PROTOCOL: TCP;PolicyVerdict: DENY
2023-12-06T14:30:30.303-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.38.227;SPORT: 5003;DIP: 172.27.32.157;DPORT: 60476;PROTOCOL: TCP;PolicyVerdict: DENY
2023-12-06T14:30:30.808-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.38.227;SPORT: 5003;DIP: 172.27.32.157;DPORT: 60476;PROTOCOL: TCP;PolicyVerdict: DENY
2023-12-06T14:30:31.953-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.38.227;SPORT: 5003;DIP: 172.27.32.157;DPORT: 60476;PROTOCOL: TCP;PolicyVerdict: DENY
2023-12-06T14:30:35.294-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.38.227;SPORT: 5003;DIP: 172.27.32.157;DPORT: 60476;PROTOCOL: TCP;PolicyVerdict: DENY
2023-12-06T14:30:41.937-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.38.227;SPORT: 5003;DIP: 172.27.32.157;DPORT: 60476;PROTOCOL: TCP;PolicyVerdict: DENY
2023-12-06T14:30:55.287-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.38.227;SPORT: 5003;DIP: 172.27.32.157;DPORT: 60476;PROTOCOL: TCP;PolicyVerdict: DENY
2023-12-06T14:30:58.838-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.32.157;SPORT: 60476;DIP: 172.27.38.227;DPORT: 5003;PROTOCOL: TCP;PolicyVerdict: ACCEPT
2023-12-06T14:30:58.850-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.32.157;SPORT: 60476;DIP: 172.27.38.227;DPORT: 5003;PROTOCOL: TCP;PolicyVerdict: ACCEPT

What you expected to happen: Assuming the correct NetworkPolicy is in place to allow comms from pod A -> pod B, I do not expect to see any DENY logs where it shows Pod B -> Pod A when the connection should be Pod A -> Pod B.

How to reproduce it (as minimally and precisely as possible): I have not been able to trigger this directly. I just see it in the logs.

Anything else we need to know?:

Environment:

jayanthvn commented 10 months ago

Thans @ed4wg . Will you be able to try this image -

<account-number>.dkr.ecr.<region>.amazonaws.com/amazon/aws-network-policy-agent:v1.0.7-rc3

Please make sure you replace the account number and region.

ed4wg commented 10 months ago

I started using it about 45 minutes ago and i'm not seeing any of those unexpected DENY logs. :)

I'll report back tomorrow after it has chance to run for longer.

ed4wg commented 10 months ago

Everything's looking good. Not a single errant DENY has showed up in the logs.

Do you have an ETA for the release?

achevuru commented 10 months ago

@ed4wg Thanks for validating the rc image. Official release should be available within the next 2 weeks.

ed4wg commented 10 months ago

In further testing, i noticed this problem is back. Here are the differences:

I saw that 1.0.7-rc4 was available and that also has the same issue.

jayanthvn commented 10 months ago

So is the behavior same as before i.e, the first connection works, but then 5 minutes later there are a bunch of DENY statements?

Can you share the node logs? You can run this script sudo bash /opt/cni/bin/aws-cni-support.sh on the node. You can mail them to k8s-awscni-triage@amazon.com. Also maybe a sample pod example to understand the flow..

ed4wg commented 10 months ago

It seems there are multiple patterns I'm observing now. I put more context in the email i sent.

achevuru commented 10 months ago

@ed4wg Can you check and let us know the kernel's conntrack table size in your nodes?

alemuro commented 10 months ago

Hello!

I'm facing the same issue, do you have any news on this topic? We managed to "fix it" by removing the default rule and deploying it again...

Thanks

jayanthvn commented 9 months ago

With reg to @ed4wg issue post scaling, I found a race condition where we try detaching a probe and netlink library returns an error for "link not found" even though the ENI is present. NA assumes the link is deleted and cleans up the pinpath while ENI is still active with a filter attached. Future updates based on the NA cache assumes probe is still active but traffic will fail since filter has a program which is already deleted causing inconsistent behavior. Will provide a fix soon.

jayanthvn commented 9 months ago

We have v1.0.8-rc1 tag available if you would like to try.

ed4wg commented 9 months ago

I ran a test on a my dev cluster using v1.0.8-rc2 over the weekend and I'm still seeing unexpected DENY logs. However, before this update, i was seeing way more of these issues. It is quite a bit less, but still there. Looking at some of the deny logs, it looks to be the same as before: problems resolving dns, problems egressing to external services like s3, problems with one pod connecting to a service on another pod.

jayanthvn commented 9 months ago

Thanks for trying out. We were able to repro and have a possible root cause to the issue. Still in process of completing the tests and code review. Will cut a release candidate once reviewed.

ed4wg commented 8 months ago

Hi @jayanthvn - any update on a fix for this issue?

jayanthvn commented 8 months ago

v1.0.8 release is available - https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.16.3. Please try it out and let us know if you see any issues..

ed4wg commented 8 months ago

I'm still seeing some unexpected DENY entries. But i'm also just pointing to the v1.0.8 aws-eks-nodeagent. The overall v1.16.3 isn't available yet via the add-on. I'm still on v1.16.2. Not sure if that makes a diff here?

jayanthvn commented 8 months ago

@ed4wg - Are you available on K8s slack? Is so can you share your handle? We can get on a call and help further look into this.

ed4wg commented 8 months ago

Hi @jayanthvn - i sent an email to k8s-awscni-triage@amazon.com with the details.

jayanthvn commented 8 months ago

Just for closure here....we synced up offline and the Deny logs are expected since those are before NP is reconciled i.e, probes are attached or while the pod/service is getting terminated. There was no functionality impact seen.