Intermittent Network Disruption when network policies are applied

Deofex commented 5 months ago

We have implemented two network policies - a default policy and a specific one tailored for our service - across multiple pods. However, intermittently (ranging from a few hours to a week), these pods cease to respond altogether. Interestingly, removing and reapplying the network policies resolves the issue, restoring normal network traffic without further complications.

The default policy is configured as follows:

spec:
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: ns-name-here
    ports:
    - port: 8000
      protocol: TCP
  podSelector: {}
  policyTypes:
  - Ingress

And the specific policy for our service is defined as:

spec:
  ingress:
  - from:
    - ipBlock:
        cidr: 0.0.0.0/0
  - ports:
    - port: 8000
      protocol: TCP
    - port: 8001
      protocol: TCP
  podSelector:
    matchLabels:
      app: replacedappname
  policyTypes:
  - Ingress

During these incidents, the logs indicate a series of Conntrack cleanup operations followed by controller activities:

{"level":"info","ts":"2024-02-13T16:33:25.888Z","logger":"ebpf-client","caller":"wait/backoff.go:227","msg":"Conntrack cleanup","Delete - ":{"Source_ip":1396295946,"Source_port":52812,"Dest_ip":191448867,"Dest_port":443,"Protocol":6,"Owner_ip":1396295946}}
{"level":"info","ts":"2024-02-13T16:33:25.888Z","logger":"ebpf-client","caller":"wait/backoff.go:227","msg":"Conntrack cleanup","Delete - ":{"Source_ip":730188042,"Source_port":53310,"Dest_ip":1362741514,"Dest_port":8001,"Protocol":6,"Owner_ip":1362741514}}
{"level":"info","ts":"2024-02-13T16:33:25.888Z","logger":"ebpf-client","caller":"wait/backoff.go:227","msg":"Conntrack cleanup","Delete - ":{"Source_ip":1769195786,"Source_port":63472,"Dest_ip":288803082,"Dest_port":8001,"Protocol":6,"Owner_ip":288803082}}
{"level":"info","ts":"2024-02-13T16:33:25.889Z","logger":"ebpf-client","caller":"wait/backoff.go:227","msg":"Conntrack cleanup","Delete - ":{"Source_ip":3560098058,"Source_port":42521,"Dest_ip":167777452,"Dest_port":53,"Protocol":17,"Owner_ip":3560098058}}
{"level":"info","ts":"2024-02-13T16:33:25.889Z","logger":"ebpf-client","caller":"wait/backoff.go:227","msg":"Conntrack cleanup","Delete - ":{"Source_ip":2392901898,"Source_port":48788,"Dest_ip":1396295946,"Dest_port":8000,"Protocol":6,"Owner_ip":1396295946}}
{"level":"info","ts":"2024-02-13T16:33:25.889Z","logger":"ebpf-client","caller":"wait/backoff.go:227","msg":"Done cleanup of conntrack map"}

Subsequently:

{"level":"info","ts":"2024-02-13T16:33:56.216Z","logger":"controllers.policyEndpoints","caller":"controller/controller.go:316","msg":"Received a new reconcile request","req":{"name":"replacedappname-metadata-tzd6c","namespace":"replacedappname"}}
{"level":"info","ts":"2024-02-13T16:33:56.216Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:127","msg":"Processing Policy Endpoint  ","Name: ":"replacedappname-metadata-tzd6c","Namespace ":"replacedappname"}
{"level":"info","ts":"2024-02-13T16:33:56.216Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:146","msg":"Parent NP resource:","Name: ":"replacedappname-metadata"}
{"level":"info","ts":"2024-02-13T16:33:56.216Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:196","msg":"Found another PE resource for the parent NP","name":"replacedappname-metadata-tzd6c"}
{"level":"info","ts":"2024-02-13T16:33:56.216Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:146","msg":"Total PEs for Parent NP:","Count: ":1}
{"level":"info","ts":"2024-02-13T16:33:56.216Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:146","msg":"Derive PE Object ","Name ":"replacedappname-metadata-tzd6c"}
{"level":"info","ts":"2024-02-13T16:33:56.216Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:146","msg":"Processing PE ","Name ":"replacedappname-metadata-tzd6c"}
{"level":"info","ts":"2024-02-13T16:33:56.216Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:196","msg":"Found a matching Pod: ","name: ":"replacedappname-metadata-647bdff4-np7kj","namespace: ":"replacedappname"}
{"level":"info","ts":"2024-02-13T16:33:56.217Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:196","msg":"Derived ","Pod identifier: ":"replacedappname-metadata-647bdff4-replacedappname"}
{"level":"info","ts":"2024-02-13T16:33:56.217Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:451","msg":"Current PE Count for Parent NP:","Count: ":1}
{"level":"info","ts":"2024-02-13T16:33:56.217Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:451","msg":"PE for parent NP","name":"replacedappname-metadata-tzd6c"}
{"level":"info","ts":"2024-02-13T16:33:56.217Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:196","msg":"Found a matching Pod: ","name: ":"replacedappname-metadata-647bdff4-l98c6","namespace: ":"replacedappname"}
{"level":"info","ts":"2024-02-13T16:33:56.217Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:196","msg":"Derived ","Pod identifier: ":"replacedappname-metadata-647bdff4-replacedappname"}
{"level":"info","ts":"2024-02-13T16:33:56.217Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:451","msg":"Current PE Count for Parent NP:","Count: ":1}
{"level":"info","ts":"2024-02-13T16:33:56.217Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:451","msg":"PE for parent NP","name":"replacedappname-metadata-tzd6c"}
{"level":"info","ts":"2024-02-13T16:33:56.217Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:146","msg":"Updating Pod: ","Name: ":"replacedappname-metadata-647bdff4-7f2m4","Namespace: ":"replacedappname"}
{"level":"info","ts":"2024-02-13T16:33:56.217Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:178","msg":"Total number of PolicyEndpoint resources for","podIdentifier ":"replacedappname-metadata-647bdff4-replacedappname"," are ":1}
{"level":"info","ts":"2024-02-13T16:33:56.217Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:178","msg":"Deriving Firewall rules for PolicyEndpoint:","Name: ":"replacedappname-default-88frv"}
{"level":"info","ts":"2024-02-13T16:33:56.217Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:178","msg":"Total no.of - ","ingressRules":3,"egressRules":0}
{"level":"info","ts":"2024-02-13T16:33:56.217Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:200","msg":"Active policies against this pod. Skip Detaching probes and Update Maps... "}
{"level":"info","ts":"2024-02-13T16:33:56.217Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:200","msg":"No Egress rules and no egress isolation - Appending catch all entry"}
{"level":"info","ts":"2024-02-13T16:33:56.217Z","logger":"ebpf-client","caller":"controllers/policyendpoints_controller.go:326","msg":"Pod has an Ingress hook attached. Update the corresponding map","progFD: ":44,"mapName: ":"ingress_map"}

It's worth noting that this issue affects only one service out of several others with similar network policy configurations. This particular service experiences higher request volumes and undergoes frequent scaling operations.

Kubernetes version: v1.29.1
CNI Version: v1.16.2-eksbuild.1
Network Policy Agent Version: v1.0.7-eksbuild.1

jayanthvn commented 5 months ago

@Deofex - Similar issue was fixed. Can you please try v1.0.8-rc3?

Deofex commented 5 months ago

I'll test the version you suggested. It might take some time to determine the outcome, given the sporadic nature of the issue it could be resolved in a few hours or potentially take a week. If there's a strong likelihood that this bug is fixed in v1.0.8-rc3, you can go ahead and archive this report.

Thanks for your assistance!

jayanthvn commented 5 months ago

Thanks @Deofex. Please keep us updated. v1.0.8 release is available - https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.16.3

fabidick22 commented 5 months ago

Hi @jayanthvn, we have been experiencing similar behaviors with our Flux service installed in our cluster. Health checks (Liveness and Readiness) are failing due to Timeout errors, causing the controllers to constantly restart. These errors are intermittent and occur for a few seconds until the service restarts.

Environment: Kubernetes version: v1.28 CNI Version: v1.15.3-eksbuild.1 Network Policy Agent Version: v1.0.5-eksbuild.1

These are some of the logs I can see: network-policy-agent.log

{"level":"info","ts":"2024-02-20T21:50:15.420Z","logger":"ebpf-client","caller":"controllers/policyendpoints_controller.go:413","msg":"BPF map update failed","error: ":"unable to update map: invalid argument"}
{"level":"info","ts":"2024-02-20T21:50:15.420Z","logger":"ebpf-client","caller":"controllers/policyendpoints_controller.go:267","msg":"Ingress Map update failed: ","error: ":"unable to update map: invalid argument"}
{"level":"info","ts":"2024-02-20T21:50:15.420Z","logger":"ebpf-client","caller":"controllers/policyendpoints_controller.go:267","msg":"Pod has an Egress hook attached. Update the corresponding map","progFD: ":42,"mapName: ":"egress_map"}

ebpf-sdk.log

{"level":"info","ts":"2024-02-20T21:52:15.257Z","caller":"ebpf/bpf_client.go:708","msg":"Check for stale entries and got 2 entries from BPF map"}
{"level":"info","ts":"2024-02-20T21:52:15.257Z","caller":"ebpf/bpf_client.go:708","msg":"Checking if key  \u0000\u0000\u0000\ufffd\u0012fT is deltable"}
{"level":"info","ts":"2024-02-20T21:52:15.257Z","caller":"ebpf/bpf_client.go:708","msg":"Checking if key \u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000 is deltable"}
{"level":"info","ts":"2024-02-20T21:52:19.469Z","caller":"maps/loader.go:636","msg":"Got next map entry with fd : 0 and err errno 0"}
{"level":"info","ts":"2024-02-20T21:52:19.469Z","caller":"conntrack/conntrack_client.go:93","msg":"Got map entry with ret : 0 and err errno 0"}
{"level":"info","ts":"2024-02-20T21:52:19.469Z","caller":"conntrack/conntrack_client.go:115","msg":"Got next map entry with fd : 0 and err errno 0"}
{"level":"error","ts":"2024-02-20T21:52:19.469Z","caller":"conntrack/conntrack_client.go:93","msg":"unable to get map entry and ret -1 and err no such file or directory"}

Is it possible that this is related to the same issue you mentioned in the previous message?

aws / aws-network-policy-agent

Intermittent Network Disruption when network policies are applied #210