Open 617m4rc opened 1 month ago
Hi @617m4rc, Does this happen intermittently and gets resolved without taking any actions? How did the above issue get resolved for you ?
Could you try with the latest rc image in your cluster and see if you run into this issue ? We have a possible fix for this issue in this rc image. You can update the image tag for network-policy-agent in your cluster to v1.1.3-rc1
and see if you hit this issue again
Hi @jaydeokar,
version v1.1.3-rc1 shows the same behavior. In our experience, the affected pods do not recover without intervention. We have implemented a retry mechanism in our workload that recreates affected pods. In many cases, a second or third attempt works without problems.
@617m4rc - does it recover eventually, as in the DENY goes to ACCEPT, or did you have a work around this?
a retry mechanism in our workload that recreates affected pods In many cases, a second or third attempt works without problems.
Does this mean, after a new pod gets a new ip, you can still see this?
Also, are you on strict mode or standard mode of network policy enforcement?
Could you send us the node logs here with the rc image where you ran into this issue. k8s-awscni-triage@amazon.com. Also, if you can share the network policy that is attached to the pods
In many cases, a second or third attempt works without problems.
You mean recreating pods works? Do you see this issue when the pod is long running or only the pod is just launched ?
There are a lot of another issues with Networkpolicies here:
https://github.com/aws/aws-network-policy-agent/issues/288 https://github.com/aws/aws-network-policy-agent/issues/236 https://github.com/aws/aws-network-policy-agent/issues/73
Would be nice, if all of them could be fixed.
@albertschwarzkopf - Can you please verify with the latest released image - https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.18.5? If you run into any of the issues please let us know.
@jayanthvn Thanks for the info. I have updated the eks addon and will watch it the next days.
We have updated to v1.18.5 and the problem remains. We also tried implementing an init container with a 2-second delay as proposed in https://github.com/aws/aws-network-policy-agent/issues/288#issuecomment-2389704801 but that mitigates the problem just partially.
@jayanthvn I still see sporadic disconnections. Especially when pods are restarted (e.g. in case of scaling operations)
@jayanthvn I still see sporadic disconnections. Especially when pods are restarted (e.g. in case of scaling operations)
Can you check to confirm this behavior only happens when pods are starting? Is the cluster on Standard
or Strict
mode?
Also does an init small waiting help (for validation purpose)?
@haouc no I cannot confirm it unfortunately, that it happens in case of restarts only, but I have observed it several times by restarting a pod. But as I said, it happens sporadically.
I have set "ANNOTATE_POD_IP":"true"
in the Add-On. But we do not use an init wait step. And we are using the "Standard" mode. But I think a feature like Networkpolicies should work without some hacking tricks.
What happened:
Network policy agent sporadically denies network traffic initiated by our workload even though there are network policies in place that explicitly allow such traffic. Denied traffic includes access to DNS, but also access to K8s services in the same namespace.
Attach logs Full logs can be provided if required.
network-policy-agent.log:
ebpf-sdk.log:
journalctl.log:
What you expected to happen:
Network connectivity is NOT denied.
How to reproduce it (as minimally and precisely as possible):
Unclear
Anything else we need to know?:
Environment:
kubectl version
):cat /etc/os-release
):uname -a
):