Open steveteahan opened 2 months ago
@steveteahan - On one of the node with impacted pod, can you please check the SDK logs and see if you notice this line around the time of the event -
SDK logs location - /var/log/aws-routed-eni/ebpf-sdk.log
error: ":"unable to update map: invalid argument"}
@jayanthvn I'll have to find some time to reproduce the issue again. It may not be for a few days. I didn't get a chance to capture those logs originally, but I'll make sure to run the capture script on the next one.
@jayanthvn I haven't had as much time to reproduce in our development environment as I had hoped. Is this issue something that you also had a chance to reproduce at all? I'm concerned that this bug could prevent the usage of NetworkPolicy
on foundational services that have many pods connecting to them.
@steveteahan
There is one bug which we fixed in the latest v1.1.3 release where the IPs can get garbage collected when the SDK tries to make an update resulting in a traffic getting blocked. Have you tried with the latest version and see if you run into the same issue ?
@steveteahan There is one bug which we fixed in the latest v1.1.3 release where the IPs can get garbage collected when the SDK tries to make an update resulting in a traffic getting blocked. Have you tried with the latest version and see if you run into the same issue ?
I have not tried the latest version. Would that bug only present itself in scenarios where there is >1 PolicyEndpoint
? We still have not experienced this issue since I changed the NetworkPolicy
rule such that there is only a single PolicyEndpoint
.
What happened:
We have an application that is failing readiness and liveness probes because the traffic is being denied by NetPol agent. We've seen this across multiple versions including
v1.1.0-eksbuild.1
andv1.1.2-eksbuild.1
.I was able to see that the network traffic was being denied in
/var/log/aws-routed-eni/network-policy-agent.log
. After some period of time, the traffic is accepted again and the application recovers.What stuck out to me is that there are multiple
PolicyEndpoint
s created. Our NP looks something like:Think of use cases where all pods need to reach a core service. This results in multiple PEs:
I tested that by changing
namespaceSelector: {}
to a rule likeipBlock.cidr: 0.0.0.0/0
, the multiple PEs are removed and a single PE is created since everyPod
in the cluster doesn't need to be enumerated in.spec.ingress
. We haven't seen a single probe failure in a week after changing the configuration to remove the multiple PEs. This is compared to literally hundreds of failures over a couple of weeks.It's also worth noting that this is an intermittent issue. The pattern we see is that the probes fail, the container is restarted, and then the service recovers. We'll see this anywhere from 1-5 times a day. Interestingly, we see this issue on a few of our clusters with ~2000 pods, but a relatively low pod churn rate. We never see container restarts on our cluster with ~3000 pods, but a higher churn rate due to heavy usage of
CronJob
s. I can see theReceived a new reconcile request
log line happening far more frequently in/var/log/aws-routed-eni/network-policy-agent.log
on the cluster that's not experiencing this issue. This may still mean that any potential bug could still be occurring on that cluster, but the next reconciliation happens faster than the time it takes for the liveness probes to fail (~30s).Attach logs
Logs were sent.
What you expected to happen:
Liveness / readiness probe traffic is not denied.
How to reproduce it (as minimally and precisely as possible):
http-get http://some-endpoint delay=0s timeout=3s period=5s #success=1 #failure=6
NetworkPolicy
usingnamespaceSelector: {}
for ingress rulesAnything else we need to know?:
Environment:
kubectl version
):v1.18.2-eksbuild.1
v1.1.2-eksbuild.1
cat /etc/os-release
):uname -a
):