aws / aws-network-policy-agent

Apache License 2.0
45 stars 29 forks source link

Network traffic sporadically denied despite valid network policies #307

Open 617m4rc opened 1 month ago

617m4rc commented 1 month ago

What happened:

Network policy agent sporadically denies network traffic initiated by our workload even though there are network policies in place that explicitly allow such traffic. Denied traffic includes access to DNS, but also access to K8s services in the same namespace.

Attach logs Full logs can be provided if required.

network-policy-agent.log:

...
{"level":"info","ts":"2024-09-23T12:36:50.288Z","logger":"ebpf-client","caller":"ebpf/bpf_client.go:725","msg":"ID of map to update: ","ID: ":210}
{"level":"info","ts":"2024-09-23T12:36:50.289Z","logger":"ebpf-client","caller":"ebpf/bpf_client.go:729","msg":"BPF map update failed","error: ":"unable to update map: invalid argument"}
{"level":"info","ts":"2024-09-23T12:36:50.289Z","logger":"ebpf-client","caller":"ebpf/bpf_client.go:686","msg":"Egress Map update failed: ","error: ":"unable to update map: invalid argument"}
{"level":"info","ts":"2024-09-23T12:36:53.562Z","logger":"ebpf-client","caller":"events/events.go:193","msg":"Flow Info:  ","Src IP":"10.0.141.167","Src Port":39197,"Dest IP":"172.20.0.10","Dest Port":53,"Proto":"UDP","Verdict":"ACCEPT"}
{"level":"info","ts":"2024-09-23T12:36:53.562Z","logger":"ebpf-client","caller":"events/events.go:126","msg":"Sending logs to CW"}
{"level":"info","ts":"2024-09-23T12:36:53.604Z","logger":"ebpf-client","caller":"events/events.go:193","msg":"Flow Info:  ","Src IP":"10.0.141.167","Src Port":43088,"Dest IP":"172.20.2.72","Dest Port":14220,"Proto":"TCP","Verdict":"DENY"}
{"level":"info","ts":"2024-09-23T12:36:53.604Z","logger":"ebpf-client","caller":"events/events.go:126","msg":"Sending logs to CW"}
{"level":"info","ts":"2024-09-23T12:36:59.577Z","logger":"ebpf-client","caller":"events/events.go:193","msg":"Flow Info:  ","Src IP":"10.0.141.167","Src Port":43096,"Dest IP":"172.20.2.72","Dest Port":14220,"Proto":"TCP","Verdict":"DENY"}
{"level":"info","ts":"2024-09-23T12:36:59.577Z","logger":"ebpf-client","caller":"events/events.go:126","msg":"Sending logs to CW"}
{"level":"info","ts":"2024-09-23T12:37:05.579Z","logger":"ebpf-client","caller":"events/events.go:193","msg":"Flow Info:  ","Src IP":"10.0.141.167","Src Port":42326,"Dest IP":"172.20.2.72","Dest Port":14220,"Proto":"TCP","Verdict":"DENY"}
{"level":"info","ts":"2024-09-23T12:37:05.579Z","logger":"ebpf-client","caller":"events/events.go:126","msg":"Sending logs to CW"}
{"level":"info","ts":"2024-09-23T12:37:11.582Z","logger":"ebpf-client","caller":"events/events.go:193","msg":"Flow Info:  ","Src IP":"10.0.141.167","Src Port":44582,"Dest IP":"172.20.2.72","Dest Port":14220,"Proto":"TCP","Verdict":"DENY"}
{"level":"info","ts":"2024-09-23T12:37:11.582Z","logger":"ebpf-client","caller":"events/events.go:126","msg":"Sending logs to CW"}
{"level":"info","ts":"2024-09-23T12:37:17.585Z","logger":"ebpf-client","caller":"events/events.go:193","msg":"Flow Info:  ","Src IP":"10.0.141.167","Src Port":44584,"Dest IP":"172.20.2.72","Dest Port":14220,"Proto":"TCP","Verdict":"DENY"}
{"level":"info","ts":"2024-09-23T12:37:17.585Z","logger":"ebpf-client","caller":"events/events.go:126","msg":"Sending logs to CW"}
...
{"level":"info","ts":"2024-09-23T12:41:52.195Z","logger":"ebpf-client","caller":"events/events.go:193","msg":"Flow Info:  ","Src IP":"10.0.143.101","Src Port":46729,"Dest IP":"172.20.0.10","Dest Port":53,"Proto":"UDP","Verdict":"DENY"}
{"level":"info","ts":"2024-09-23T12:41:52.195Z","logger":"ebpf-client","caller":"events/events.go:126","msg":"Sending logs to CW"}
{"level":"info","ts":"2024-09-23T12:41:54.697Z","logger":"ebpf-client","caller":"events/events.go:193","msg":"Flow Info:  ","Src IP":"10.0.143.101","Src Port":46729,"Dest IP":"172.20.0.10","Dest Port":53,"Proto":"UDP","Verdict":"DENY"}
{"level":"info","ts":"2024-09-23T12:41:54.698Z","logger":"ebpf-client","caller":"events/events.go:126","msg":"Sending logs to CW"}
...

ebpf-sdk.log:

...
{"level":"error","ts":"2024-09-23T12:36:49.239Z","caller":"maps/loader.go:284","msg":"unable to create/update map entry and ret -1 and err invalid argument"}
{"level":"info","ts":"2024-09-23T12:36:49.239Z","caller":"maps/loader.go:484","msg":"One of the element update failed hence returning from bulk update"}
{"level":"error","ts":"2024-09-23T12:36:49.239Z","caller":"ebpf/bpf_client.go:726","msg":"refresh map failed: during update unable to update map: invalid argument"}
...

journalctl.log:

...
Sep 23 12:36:47.190274 ip-xx-xx-xx-xx.eu-north-1.compute.internal kubelet[1652]: I0923 12:36:47.190206    1652 topology_manager.go:215] "Topology Admit Handler" podUID="963bc2a7-4fbd-4fb9-9af9-818ac4abf5dd" podNamespace="xxx" podName="xxx"
Sep 23 12:36:47.190274 ip-xx-xx-xx-xx.eu-north-1.compute.internal kubelet[1652]: E0923 12:36:47.190255    1652 cpu_manager.go:395] "RemoveStaleState: removing container" podUID="5e57a22e-aea1-4474-9e78-5283a1de37b8" containerName="spark-kubernetes-driver"
Sep 23 12:36:47.190274 ip-xx-xx-xx-xx.eu-north-1.compute.internal kubelet[1652]: E0923 12:36:47.190262    1652 cpu_manager.go:395] "RemoveStaleState: removing container" podUID="a24d92cf-8612-4e94-a812-9400e9aca96e" containerName="spark-kubernetes-executor"
Sep 23 12:36:47.190696 ip-xx-xx-xx-xx.eu-north-1.compute.internal kubelet[1652]: I0923 12:36:47.190280    1652 memory_manager.go:354] "RemoveStaleState removing state" podUID="a24d92cf-8612-4e94-a812-9400e9aca96e" containerName="spark-kubernetes-executor"
Sep 23 12:36:47.190696 ip-xx-xx-xx-xx.eu-north-1.compute.internal kubelet[1652]: I0923 12:36:47.190286    1652 memory_manager.go:354] "RemoveStaleState removing state" podUID="5e57a22e-aea1-4474-9e78-5283a1de37b8" containerName="spark-kubernetes-driver"
Sep 23 12:36:47.194309 ip-xx-xx-xx-xx.eu-north-1.compute.internal systemd[1]: Created slice libcontainer container kubepods-burstable-pod963bc2a7_4fbd_4fb9_9af9_818ac4abf5dd.slice.
Sep 23 12:36:47.275505 ip-xx-xx-xx-xx.eu-north-1.compute.internal kubelet[1652]: I0923 12:36:47.275484    1652 reconciler_common.go:247] "operationExecutor.VerifyControllerAttachedVolume started for volume \"spark-local-dir-1\" (UniqueName: \"kubernetes.io/empty-dir/963bc2a7-4fbd-4fb9-9af9-818ac4abf5dd-spark-local-dir-1\") pod \"xxx\" (UID: \"963bc2a7-4fbd-4fb9-9af9-818ac4abf5dd\") " pod="es4/xxx"
Sep 23 12:36:47.275581 ip-xx-xx-xx-xx.eu-north-1.compute.internal kubelet[1652]: I0923 12:36:47.275512    1652 reconciler_common.go:247] "operationExecutor.VerifyControllerAttachedVolume started for volume \"pvc-ce6c2f7d-53dc-4a4e-b0b8-531e86146c63\" (UniqueName: \"kubernetes.io/csi/efs.csi.aws.com^fs-xxx::fsap-xxx\") pod \"xxx\" (UID: \"963bc2a7-4fbd-4fb9-9af9-818ac4abf5dd\") " pod="es4/xxx"
Sep 23 12:36:47.275581 ip-xx-xx-xx-xx.eu-north-1.compute.internal kubelet[1652]: I0923 12:36:47.275533    1652 reconciler_common.go:247] "operationExecutor.VerifyControllerAttachedVolume started for volume \"pvc-98853799-8689-47b1-a434-561e0b325181\" (UniqueName: \"kubernetes.io/csi/efs.csi.aws.com^fs-xxx::fsap-xxx\") pod \"xxx\" (UID: \"963bc2a7-4fbd-4fb9-9af9-818ac4abf5dd\") " pod="es4/xxx"
Sep 23 12:36:47.275581 ip-xx-xx-xx-xx.eu-north-1.compute.internal kubelet[1652]: I0923 12:36:47.275552    1652 reconciler_common.go:247] "operationExecutor.VerifyControllerAttachedVolume started for volume \"spark-conf-volume-driver\" (UniqueName: \"kubernetes.io/configmap/963bc2a7-4fbd-4fb9-9af9-818ac4abf5dd-spark-conf-volume-driver\") pod \"xxx\" (UID: \"963bc2a7-4fbd-4fb9-9af9-818ac4abf5dd\") " pod="es4/xxx"
Sep 23 12:36:47.275581 ip-xx-xx-xx-xx.eu-north-1.compute.internal kubelet[1652]: I0923 12:36:47.275565    1652 reconciler_common.go:247] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-b8x46\" (UniqueName: \"kubernetes.io/projected/963bc2a7-4fbd-4fb9-9af9-818ac4abf5dd-kube-api-access-b8x46\") pod \"xxx\" (UID: \"963bc2a7-4fbd-4fb9-9af9-818ac4abf5dd\") " pod="es4/xxx"
Sep 23 12:36:47.275650 ip-xx-xx-xx-xx.eu-north-1.compute.internal kubelet[1652]: I0923 12:36:47.275585    1652 reconciler_common.go:247] "operationExecutor.VerifyControllerAttachedVolume started for volume \"pvc-9d6a7fa2-0381-4ac2-8384-8af9c2c2e093\" (UniqueName: \"kubernetes.io/csi/efs.csi.aws.com^fs-xxx::fsap-xxx\") pod \"xxx\" (UID: \"963bc2a7-4fbd-4fb9-9af9-818ac4abf5dd\") " pod="es4/xxx"
Sep 23 12:36:47.376151 ip-xx-xx-xx-xx.eu-north-1.compute.internal kubelet[1652]: E0923 12:36:47.376129    1652 configmap.go:199] Couldn't get configMap es4/spark-drv-cbb21d921ee26617-conf-map: configmap "spark-drv-cbb21d921ee26617-conf-map" not found
Sep 23 12:36:47.376266 ip-xx-xx-xx-xx.eu-north-1.compute.internal kubelet[1652]: E0923 12:36:47.376254    1652 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/configmap/963bc2a7-4fbd-4fb9-9af9-818ac4abf5dd-spark-conf-volume-driver podName:963bc2a7-4fbd-4fb9-9af9-818ac4abf5dd nodeName:}" failed. No retries permitted until 2024-09-23 12:36:47.876238946 +0000 UTC m=+1708.855339997 (durationBeforeRetry 500ms). Error: MountVolume.SetUp failed for volume "spark-conf-volume-driver" (UniqueName: "kubernetes.io/configmap/963bc2a7-4fbd-4fb9-9af9-818ac4abf5dd-spark-conf-volume-driver") pod "xxx" (UID: "963bc2a7-4fbd-4fb9-9af9-818ac4abf5dd") : configmap "spark-drv-cbb21d921ee26617-conf-map" not found
Sep 23 12:36:47.376962 ip-xx-xx-xx-xx.eu-north-1.compute.internal kubelet[1652]: I0923 12:36:47.376941    1652 csi_attacher.go:380] kubernetes.io/csi: attacher.MountDevice STAGE_UNSTAGE_VOLUME capability not set. Skipping MountDevice...
Sep 23 12:36:47.376962 ip-xx-xx-xx-xx.eu-north-1.compute.internal kubelet[1652]: I0923 12:36:47.376950    1652 csi_attacher.go:380] kubernetes.io/csi: attacher.MountDevice STAGE_UNSTAGE_VOLUME capability not set. Skipping MountDevice...
Sep 23 12:36:47.377042 ip-xx-xx-xx-xx.eu-north-1.compute.internal kubelet[1652]: I0923 12:36:47.376974    1652 operation_generator.go:664] "MountVolume.MountDevice succeeded for volume \"pvc-9d6a7fa2-0381-4ac2-8384-8af9c2c2e093\" (UniqueName: \"kubernetes.io/csi/efs.csi.aws.com^fs-xxx::fsap-xxx\") pod \"xxx\" (UID: \"963bc2a7-4fbd-4fb9-9af9-818ac4abf5dd\") device mount path \"/var/lib/kubelet/plugins/kubernetes.io/csi/efs.csi.aws.com/ae924cc015ee433d75a92309dcb9847685e112b971839d52d9b32171790ca4f4/globalmount\"" pod="es4/xxx"
Sep 23 12:36:47.377042 ip-xx-xx-xx-xx.eu-north-1.compute.internal kubelet[1652]: I0923 12:36:47.377004    1652 csi_attacher.go:380] kubernetes.io/csi: attacher.MountDevice STAGE_UNSTAGE_VOLUME capability not set. Skipping MountDevice...
Sep 23 12:36:47.377096 ip-xx-xx-xx-xx.eu-north-1.compute.internal kubelet[1652]: I0923 12:36:47.377032    1652 operation_generator.go:664] "MountVolume.MountDevice succeeded for volume \"pvc-ce6c2f7d-53dc-4a4e-b0b8-531e86146c63\" (UniqueName: \"kubernetes.io/csi/efs.csi.aws.com^fs-xxx::fsap-xxx\") pod \"xxx\" (UID: \"963bc2a7-4fbd-4fb9-9af9-818ac4abf5dd\") device mount path \"/var/lib/kubelet/plugins/kubernetes.io/csi/efs.csi.aws.com/8405127b1373651bed7a9bb1be7e070a81b25f0127c1a0bc2a76e09ce3d35360/globalmount\"" pod="es4/xxx"
Sep 23 12:36:47.377096 ip-xx-xx-xx-xx.eu-north-1.compute.internal kubelet[1652]: I0923 12:36:47.376974    1652 operation_generator.go:664] "MountVolume.MountDevice succeeded for volume \"pvc-98853799-8689-47b1-a434-561e0b325181\" (UniqueName: \"kubernetes.io/csi/efs.csi.aws.com^fs-xxx::fsap-xxx\") pod \"xxx\" (UID: \"963bc2a7-4fbd-4fb9-9af9-818ac4abf5dd\") device mount path \"/var/lib/kubelet/plugins/kubernetes.io/csi/efs.csi.aws.com/91b605ef485e66cc2d16afb6256854a2a7bee06e8ec3c706d89732200f54c473/globalmount\"" pod="es4/xxx"
Sep 23 12:36:48.096795 ip-xx-xx-xx-xx.eu-north-1.compute.internal kubelet[1652]: E0923 12:36:48.096759    1652 kubelet_pods.go:513] "Hostname for pod was too long, truncated it" podName="xxx" hostnameMaxLen=63 truncatedHostname="xxx"
Sep 23 12:36:48.097185 ip-xx-xx-xx-xx.eu-north-1.compute.internal containerd[1563]: time="2024-09-23T12:36:48.097152441Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:xxx,Uid:963bc2a7-4fbd-4fb9-9af9-818ac4abf5dd,Namespace:es4,Attempt:0,}"
Sep 23 12:36:48.225888 ip-xx-xx-xx-xx.eu-north-1.compute.internal kernel: IPv6: ADDRCONF(NETDEV_CHANGE): enie2ddee3c872: link becomes ready
Sep 23 12:36:48.226098 ip-xx-xx-xx-xx.eu-north-1.compute.internal kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Sep 23 12:36:48.225967 ip-xx-xx-xx-xx.eu-north-1.compute.internal (udev-worker)[46433]: Network interface NamePolicy= disabled on kernel command line.
Sep 23 12:36:48.225990 ip-xx-xx-xx-xx.eu-north-1.compute.internal systemd-networkd[1256]: enie2ddee3c872: Link UP
Sep 23 12:36:48.226220 ip-xx-xx-xx-xx.eu-north-1.compute.internal systemd-networkd[1256]: enie2ddee3c872: Gained carrier
Sep 23 12:36:48.227602 ip-xx-xx-xx-xx.eu-north-1.compute.internal containerd[1563]: 2024-09-23 12:36:48.11027357 +0000 UTC m=+0.003531707 write error: can't make directories for new logfile: mkdir /host: read-only file system
Sep 23 12:36:48.227602 ip-xx-xx-xx-xx.eu-north-1.compute.internal containerd[1563]: 2024-09-23 12:36:48.110343501 +0000 UTC m=+0.003601624 write error: can't make directories for new logfile: mkdir /host: read-only file system
Sep 23 12:36:48.244671 ip-xx-xx-xx-xx.eu-north-1.compute.internal containerd[1563]: time="2024-09-23T12:36:48.244596485Z" level=info msg="loading plugin \"io.containerd.event.v1.publisher\"..." runtime=io.containerd.runc.v2 type=io.containerd.event.v1
Sep 23 12:36:48.244671 ip-xx-xx-xx-xx.eu-north-1.compute.internal containerd[1563]: time="2024-09-23T12:36:48.244639160Z" level=info msg="loading plugin \"io.containerd.internal.v1.shutdown\"..." runtime=io.containerd.runc.v2 type=io.containerd.internal.v1
Sep 23 12:36:48.244671 ip-xx-xx-xx-xx.eu-north-1.compute.internal containerd[1563]: time="2024-09-23T12:36:48.244652634Z" level=info msg="loading plugin \"io.containerd.ttrpc.v1.task\"..." runtime=io.containerd.runc.v2 type=io.containerd.ttrpc.v1
Sep 23 12:36:48.244865 ip-xx-xx-xx-xx.eu-north-1.compute.internal containerd[1563]: time="2024-09-23T12:36:48.244825414Z" level=info msg="starting signal loop" namespace=k8s.io path=/run/containerd/io.containerd.runtime.v2.task/k8s.io/85764916af3483ba8545dff4e1bd4790a944551eedcd63433d4ed20e853782db pid=46568 runtime=io.containerd.runc.v2
Sep 23 12:36:48.325622 ip-xx-xx-xx-xx.eu-north-1.compute.internal systemd[1]: Started libcontainer container 85764916af3483ba8545dff4e1bd4790a944551eedcd63433d4ed20e853782db.
Sep 23 12:36:48.365651 ip-xx-xx-xx-xx.eu-north-1.compute.internal containerd[1563]: time="2024-09-23T12:36:48.365582463Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:xxx,Uid:963bc2a7-4fbd-4fb9-9af9-818ac4abf5dd,Namespace:es4,Attempt:0,} returns sandbox id \"85764916af3483ba8545dff4e1bd4790a944551eedcd63433d4ed20e853782db\""
Sep 23 12:36:48.366170 ip-xx-xx-xx-xx.eu-north-1.compute.internal kubelet[1652]: E0923 12:36:48.366142    1652 kubelet_pods.go:513] "Hostname for pod was too long, truncated it" podName="xxx" hostnameMaxLen=63 truncatedHostname="xxx"
Sep 23 12:36:48.366718 ip-xx-xx-xx-xx.eu-north-1.compute.internal kubelet[1652]: E0923 12:36:48.366694    1652 kubelet_pods.go:513] "Hostname for pod was too long, truncated it" podName="xxx" hostnameMaxLen=63 truncatedHostname="xxx"
Sep 23 12:36:48.368961 ip-xx-xx-xx-xx.eu-north-1.compute.internal containerd[1563]: time="2024-09-23T12:36:48.368939503Z" level=info msg="CreateContainer within sandbox \"85764916af3483ba8545dff4e1bd4790a944551eedcd63433d4ed20e853782db\" for container &ContainerMetadata{Name:spark-kubernetes-driver,Attempt:0,}"
Sep 23 12:36:48.378089 ip-xx-xx-xx-xx.eu-north-1.compute.internal containerd[1563]: time="2024-09-23T12:36:48.378056875Z" level=info msg="CreateContainer within sandbox \"85764916af3483ba8545dff4e1bd4790a944551eedcd63433d4ed20e853782db\" for &ContainerMetadata{Name:spark-kubernetes-driver,Attempt:0,} returns container id \"103b7cd75ed53f237af9e05874674af580d321eb2d768ca7362ed812a2425aad\""
Sep 23 12:36:48.378433 ip-xx-xx-xx-xx.eu-north-1.compute.internal containerd[1563]: time="2024-09-23T12:36:48.378410736Z" level=info msg="StartContainer for \"103b7cd75ed53f237af9e05874674af580d321eb2d768ca7362ed812a2425aad\""
Sep 23 12:36:48.399959 ip-xx-xx-xx-xx.eu-north-1.compute.internal systemd[1]: Started libcontainer container 103b7cd75ed53f237af9e05874674af580d321eb2d768ca7362ed812a2425aad.
Sep 23 12:36:48.418502 ip-xx-xx-xx-xx.eu-north-1.compute.internal containerd[1563]: time="2024-09-23T12:36:48.418470422Z" level=info msg="StartContainer for \"103b7cd75ed53f237af9e05874674af580d321eb2d768ca7362ed812a2425aad\" returns successfully"
Sep 23 12:36:49.135580 ip-xx-xx-xx-xx.eu-north-1.compute.internal kubelet[1652]: E0923 12:36:49.135550    1652 kubelet_pods.go:513] "Hostname for pod was too long, truncated it" podName="xxx" hostnameMaxLen=63 truncatedHostname="xxx"
Sep 23 12:36:49.165565 ip-xx-xx-xx-xx.eu-north-1.compute.internal kubelet[1652]: I0923 12:36:49.165526    1652 pod_startup_latency_tracker.go:104] "Observed pod startup duration" pod="es4/xxx" podStartSLOduration=2.165515378 podStartE2EDuration="2.165515378s" podCreationTimestamp="2024-09-23 12:36:47 +0000 UTC" firstStartedPulling="0001-01-01 00:00:00 +0000 UTC" lastFinishedPulling="0001-01-01 00:00:00 +0000 UTC" observedRunningTime="2024-09-23 12:36:49.162698174 +0000 UTC m=+1710.141799217" watchObservedRunningTime="2024-09-23 12:36:49.165515378 +0000 UTC m=+1710.144616414"
Sep 23 12:36:49.214250 ip-xx-xx-xx-xx.eu-north-1.compute.internal kernel: enie2ddee3c872: Caught tx_queue_len zero misconfig
Sep 23 12:36:49.355422 ip-xx-xx-xx-xx.eu-north-1.compute.internal systemd-networkd[1256]: enie2ddee3c872: Gained IPv6LL
Sep 23 12:36:50.137138 ip-xx-xx-xx-xx.eu-north-1.compute.internal kubelet[1652]: E0923 12:36:50.137105    1652 kubelet_pods.go:513] "Hostname for pod was too long, truncated it" podName="xxx" hostnameMaxLen=63 truncatedHostname="xxx"
Sep 23 12:37:20.510457 ip-xx-xx-xx-xx.eu-north-1.compute.internal kubelet[1652]: I0923 12:37:20.510410    1652 scope.go:117] "RemoveContainer" containerID="f47a6e252dc472727849954286c87103a1f448c2fed725b8e22c7fc3429669d2"
...

What you expected to happen:

Network connectivity is NOT denied.

How to reproduce it (as minimally and precisely as possible):

Unclear

Anything else we need to know?:

Environment:

jaydeokar commented 1 month ago

Hi @617m4rc, Does this happen intermittently and gets resolved without taking any actions? How did the above issue get resolved for you ?

Could you try with the latest rc image in your cluster and see if you run into this issue ? We have a possible fix for this issue in this rc image. You can update the image tag for network-policy-agent in your cluster to v1.1.3-rc1 and see if you hit this issue again

617m4rc commented 1 month ago

Hi @jaydeokar,

version v1.1.3-rc1 shows the same behavior. In our experience, the affected pods do not recover without intervention. We have implemented a retry mechanism in our workload that recreates affected pods. In many cases, a second or third attempt works without problems.

orsenthil commented 1 month ago

@617m4rc - does it recover eventually, as in the DENY goes to ACCEPT, or did you have a work around this?

a retry mechanism in our workload that recreates affected pods In many cases, a second or third attempt works without problems.

Does this mean, after a new pod gets a new ip, you can still see this?

Also, are you on strict mode or standard mode of network policy enforcement?

jaydeokar commented 1 month ago

Could you send us the node logs here with the rc image where you ran into this issue. k8s-awscni-triage@amazon.com. Also, if you can share the network policy that is attached to the pods

In many cases, a second or third attempt works without problems.

You mean recreating pods works? Do you see this issue when the pod is long running or only the pod is just launched ?

albertschwarzkopf commented 1 month ago

There are a lot of another issues with Networkpolicies here:

https://github.com/aws/aws-network-policy-agent/issues/288 https://github.com/aws/aws-network-policy-agent/issues/236 https://github.com/aws/aws-network-policy-agent/issues/73

Would be nice, if all of them could be fixed.

jayanthvn commented 1 month ago

@albertschwarzkopf - Can you please verify with the latest released image - https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.18.5? If you run into any of the issues please let us know.

albertschwarzkopf commented 1 month ago

@jayanthvn Thanks for the info. I have updated the eks addon and will watch it the next days.

617m4rc commented 1 month ago

We have updated to v1.18.5 and the problem remains. We also tried implementing an init container with a 2-second delay as proposed in https://github.com/aws/aws-network-policy-agent/issues/288#issuecomment-2389704801 but that mitigates the problem just partially.

albertschwarzkopf commented 1 month ago

@jayanthvn I still see sporadic disconnections. Especially when pods are restarted (e.g. in case of scaling operations)

haouc commented 1 month ago

@jayanthvn I still see sporadic disconnections. Especially when pods are restarted (e.g. in case of scaling operations)

Can you check to confirm this behavior only happens when pods are starting? Is the cluster on Standard or Strict mode?

Also does an init small waiting help (for validation purpose)?

albertschwarzkopf commented 1 month ago

@haouc no I cannot confirm it unfortunately, that it happens in case of restarts only, but I have observed it several times by restarting a pod. But as I said, it happens sporadically. I have set "ANNOTATE_POD_IP":"true" in the Add-On. But we do not use an init wait step. And we are using the "Standard" mode. But I think a feature like Networkpolicies should work without some hacking tricks.