Apache Zookeeper: Periodic connection loss

xashr commented 9 months ago

What happened: After migrating from Calico to Amazon VPC CNI Addon, we observed problems with Strimzi Kafka, more precisely with Apache Zookeeper.

Strimzi installs a network policy by default to allow communication between Zookeeper pods (and other Strimzi related pods). So in a cluster with no network policies besides these default Zookeeper policy, the following happens in a namespace with 3 zookeeper replicas:

Leader election is successful, Zookeeper pods communicate without problems
After ~ 5 Minutes zookeeper pods lose connection
Connection is established again, (Leader election)
After ~ 5 Minutes zookeeper pods lose connection
... (repeat)

Attach logs In the Zookeeper logs the connection loss usually shows up like this:

[myid:2] - ERROR [LearnerHandler-/10.0.112.14:60242:LearnerHandler@714] - Unexpected exception causing shutdown while sock still open
  java.net.SocketTimeoutException: Read timed out  
...
[myid:2] - WARN  [LearnerHandler-/10.0.112.14:60242:LearnerHandler@737] - ******* GOODBYE /10.0.112.14:60242 ********

or

[myid:2] - ERROR [LearnerHandler-/10.0.114.144:42736:LearnerHandler@714] - Unexpected exception causing shutdown while sock still open
  java.io.EOFException
...
[myid:2] - WARN  [LearnerHandler-/10.0.114.144:42736:LearnerHandler@737] - ******* GOODBYE /10.0.114.144:42736 ********

What you expected to happen: Established connections do not get dropped (periodically).

How to reproduce it (as minimally and precisely as possible): We were able to make it easily reproducible by installing Zookeeper from Bitnami and applying a Strimzi-like network policy:

Install zookeeper with 3 replicas

helm repo add bitnami https://charts.bitnami.com/bitnami
helm upgrade --install zookeepertest bitnami/zookeeper --version 10.2.5 --set replicaCount=3 --set logLevel=INFO

Install "Strimzi-like" network policy:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: netpol-zookeepertest
spec:
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app.kubernetes.io/instance: zookeepertest
    ports:
    - port: 2888
      protocol: TCP
    - port: 3888
      protocol: TCP
  - from:
    - podSelector:
        matchLabels:
          app.kubernetes.io/instance: zookeepertest
    ports:
    - port: 2181
      protocol: TCP
  - ports:
    - port: 9404
      protocol: TCP
  podSelector:
    matchLabels:
      app.kubernetes.io/instance: zookeepertest
  policyTypes:
  - Ingress
status: {}

Observe Zookeeper logs.

Anything else we need to know?: Something interesting we observed:

The issue can be "fixed" by adding an additional network policy, allowing ingress to additional ports in the range ~ 40000-65000

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-policy
spec:
ingress:
- from:
- podSelector:
    matchLabels:
      app.kubernetes.io/instance: zookeepertest
ports:
- port: 40000
  endPort: 65000
  protocol: TCP
podSelector:
matchLabels:
  app.kubernetes.io/instance: zookeepertest
policyTypes:
- Ingress
status: {}

Wild guess why this workaround works: The range 40000-65000 is the range of the source ports (as you can see in the logs above: GOODBYE /10.0.112.14:60242. Maybe there is a bug in the policy agent causing a loss of state after x minutes. After the state loss the communication to the source port 60242 is no longer known/accepted. With the workaround policy, however, the port range is explicitly allowed.

Environment:

AWS EKS, Kubernetes 1.26
CNI Version: amazon-k8s-cni:v1.15.4-eksbuild.1, aws-network-policy-agent:v1.0.6-eksbuild.1

jayanthvn commented 9 months ago

@xashr - This is similar to https://github.com/aws/aws-network-policy-agent/issues/144. We have a fix for this and I can provide you a release candidate image if you are willing to try it out.

xashr commented 9 months ago

@jayanthvn - Is there an RC newer than v1.0.7-rc1 ? Sounds like #144 is not fixed yet according to Rez0k?

jayanthvn commented 9 months ago

Will you be able to try this image -

<account-number>.dkr.ecr.<region>.amazonaws.com/amazon/aws-network-policy-agent:v1.0.7-rc3

Please make sure you replace the account number and region.

xashr commented 9 months ago

@jayanthvn We are staying with Calico for now, but I ran a short test with the rc3 image in a separate cluster. The issue seems to be solved with that image. Thanks!

jayanthvn commented 8 months ago

Thanks for trying out the image. Please feel free to reach out if you are having issues and we will be happy to help.

aws / aws-network-policy-agent

Apache Zookeeper: Periodic connection loss #160