aws / aws-network-policy-agent

Apache License 2.0
45 stars 29 forks source link

Experiencing issues with long lived connections being dropped #318

Open charlierm opened 1 month ago

charlierm commented 1 month ago

What happened:

This could be a possible dupe of #175 #100, we're seeing issues with long lived connections being dropped (return traffic not being allowed). Current example is the Grafana-Operator calling kube-api. We're using the latest version of the VPC CNI with bottlerocket nodes. We see logs from the operator failing:

W1010 10:33:07.550838       1 reflector.go:484] k8s.io/client-go@v0.31.1/tools/cache/reflector.go:243: watch of *v1beta1.GrafanaNotificationPolicy ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding

As well as the network policy logs:

2024-10-10 11:32:51.725 Node: ip-100-65-32-116.eu-west-1.compute.internal;SIP: 192.168.0.1;SPORT: 443;DIP: 100.65.131.18;DPORT: 44378;PROTOCOL: TCP;PolicyVerdict: DENY
2024-10-10 11:32:38.153 Node: ip-100-65-32-116.eu-west-1.compute.internal;SIP: 192.168.0.1;SPORT: 443;DIP: 100.65.131.18;DPORT: 44378;PROTOCOL: TCP;PolicyVerdict: DENY
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: grafana-operator
  namespace: platform-system
spec:
  egress:
  - ports:
    - port: 443
      protocol: TCP
    to:
    - ipBlock:
        cidr: 192.168.0.1/32
  - ports:
    - port: 443
      protocol: TCP
    to:
    - ipBlock:
        cidr: 0.0.0.0/0
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app.kubernetes.io/instance: grafana-k8s-monitoring
          app.kubernetes.io/name: alloy
    ports:
    - port: 9100
      protocol: TCP
  podSelector:
    matchLabels:
      app.kubernetes.io/name: grafana-operator
  policyTypes:
  - Egress
  - Ingress

What you expected to happen: I expect return traffic to be allowed, if I add in an explicit rule into the networkpolicy then it starts working. Also worth noting this happens intermittently.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

orsenthil commented 1 day ago

I expect return traffic to be allowed, if I add in an explicit rule into the networkpolicy then it starts working.

This is a good work around and solution for this.

I assume a new pod had come up in the destination, and return traffic got denied due to reconciliation time issue in standard mode.

Have you considered moving to strict mode?