Closed Deofex closed 5 months ago
@Deofex - Similar issue was fixed. Can you please try v1.0.8-rc3?
I'll test the version you suggested. It might take some time to determine the outcome, given the sporadic nature of the issue it could be resolved in a few hours or potentially take a week. If there's a strong likelihood that this bug is fixed in v1.0.8-rc3, you can go ahead and archive this report.
Thanks for your assistance!
Thanks @Deofex. Please keep us updated. v1.0.8 release is available - https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.16.3
Hi @jayanthvn, we have been experiencing similar behaviors with our Flux service installed in our cluster. Health checks (Liveness and Readiness) are failing due to Timeout errors, causing the controllers to constantly restart. These errors are intermittent and occur for a few seconds until the service restarts.
Environment:
Kubernetes version: v1.28
CNI Version: v1.15.3-eksbuild.1
Network Policy Agent Version: v1.0.5-eksbuild.1
These are some of the logs I can see:
network-policy-agent.log
{"level":"info","ts":"2024-02-20T21:50:15.420Z","logger":"ebpf-client","caller":"controllers/policyendpoints_controller.go:413","msg":"BPF map update failed","error: ":"unable to update map: invalid argument"}
{"level":"info","ts":"2024-02-20T21:50:15.420Z","logger":"ebpf-client","caller":"controllers/policyendpoints_controller.go:267","msg":"Ingress Map update failed: ","error: ":"unable to update map: invalid argument"}
{"level":"info","ts":"2024-02-20T21:50:15.420Z","logger":"ebpf-client","caller":"controllers/policyendpoints_controller.go:267","msg":"Pod has an Egress hook attached. Update the corresponding map","progFD: ":42,"mapName: ":"egress_map"}
ebpf-sdk.log
{"level":"info","ts":"2024-02-20T21:52:15.257Z","caller":"ebpf/bpf_client.go:708","msg":"Check for stale entries and got 2 entries from BPF map"}
{"level":"info","ts":"2024-02-20T21:52:15.257Z","caller":"ebpf/bpf_client.go:708","msg":"Checking if key \u0000\u0000\u0000\ufffd\u0012fT is deltable"}
{"level":"info","ts":"2024-02-20T21:52:15.257Z","caller":"ebpf/bpf_client.go:708","msg":"Checking if key \u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000 is deltable"}
{"level":"info","ts":"2024-02-20T21:52:19.469Z","caller":"maps/loader.go:636","msg":"Got next map entry with fd : 0 and err errno 0"}
{"level":"info","ts":"2024-02-20T21:52:19.469Z","caller":"conntrack/conntrack_client.go:93","msg":"Got map entry with ret : 0 and err errno 0"}
{"level":"info","ts":"2024-02-20T21:52:19.469Z","caller":"conntrack/conntrack_client.go:115","msg":"Got next map entry with fd : 0 and err errno 0"}
{"level":"error","ts":"2024-02-20T21:52:19.469Z","caller":"conntrack/conntrack_client.go:93","msg":"unable to get map entry and ret -1 and err no such file or directory"}
Is it possible that this is related to the same issue you mentioned in the previous message?
We have implemented two network policies - a default policy and a specific one tailored for our service - across multiple pods. However, intermittently (ranging from a few hours to a week), these pods cease to respond altogether. Interestingly, removing and reapplying the network policies resolves the issue, restoring normal network traffic without further complications.
The default policy is configured as follows:
And the specific policy for our service is defined as:
During these incidents, the logs indicate a series of Conntrack cleanup operations followed by controller activities:
Subsequently:
It's worth noting that this issue affects only one service out of several others with similar network policy configurations. This particular service experiences higher request volumes and undergoes frequent scaling operations.