Closed xashr closed 9 months ago
@xashr - This is similar to https://github.com/aws/aws-network-policy-agent/issues/144. We have a fix for this and I can provide you a release candidate image if you are willing to try it out.
@jayanthvn - Is there an RC newer than v1.0.7-rc1 ? Sounds like #144 is not fixed yet according to Rez0k?
Will you be able to try this image -
<account-number>.dkr.ecr.<region>.amazonaws.com/amazon/aws-network-policy-agent:v1.0.7-rc3
Please make sure you replace the account number and region.
@jayanthvn We are staying with Calico for now, but I ran a short test with the rc3 image in a separate cluster. The issue seems to be solved with that image. Thanks!
Thanks for trying out the image. Please feel free to reach out if you are having issues and we will be happy to help.
What happened: After migrating from Calico to Amazon VPC CNI Addon, we observed problems with Strimzi Kafka, more precisely with Apache Zookeeper.
Strimzi installs a network policy by default to allow communication between Zookeeper pods (and other Strimzi related pods). So in a cluster with no network policies besides these default Zookeeper policy, the following happens in a namespace with 3 zookeeper replicas:
Attach logs In the Zookeeper logs the connection loss usually shows up like this:
or
What you expected to happen: Established connections do not get dropped (periodically).
How to reproduce it (as minimally and precisely as possible): We were able to make it easily reproducible by installing Zookeeper from Bitnami and applying a Strimzi-like network policy:
Install zookeeper with 3 replicas
Install "Strimzi-like" network policy:
Observe Zookeeper logs.
Anything else we need to know?: Something interesting we observed:
Wild guess why this workaround works: The range 40000-65000 is the range of the source ports (as you can see in the logs above:
GOODBYE /10.0.112.14:60242
. Maybe there is a bug in the policy agent causing a loss of state after x minutes. After the state loss the communication to the source port 60242 is no longer known/accepted. With the workaround policy, however, the port range is explicitly allowed.Environment: