banzaicloud / koperator

Oh no! Yet another Apache Kafka operator for Kubernetes
Apache License 2.0
784 stars 195 forks source link

External NodePort services got killed and recreated for every several seconds. #922

Closed daisywang-ca closed 1 year ago

daisywang-ca commented 1 year ago

Describe the bug We are using koperator 0.22.0. After installed a kafka cluster with two borkers, the kafka-cluster is stuck in ClusterReCounciling state

NAME            CLUSTER STATE        CLUSTER ALERT COUNT   LAST SUCCESSFUL UPGRADE   UPGRADE ERROR COUNT   AGE
kafka-cluster   ClusterReconciling   0                                               0                     3d19h

And the two nodePort services restart every several seconds.

NAME                              TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                                 AGE
kafka-cluster-0-external          NodePort    10.233.5.235    <none>        9094:30091/TCP                          8s
kafka-cluster-1-external          NodePort    10.233.20.3     <none>        9094:30092/TCP                          8s
kafka-cluster-cruisecontrol-svc   ClusterIP   10.233.35.58    <none>        8090/TCP,9020/TCP                       3d19h
kafka-cluster-headless            ClusterIP   None            <none>        29092/TCP,29093/TCP,9094/TCP,9020/TCP   3d19h
kafka-operator-alertmanager       ClusterIP   10.233.11.110   <none>        9001/TCP                                3d19h
kafka-operator-authproxy          ClusterIP   10.233.63.33    <none>        8443/TCP                                3d19h
kafka-operator-operator           ClusterIP   10.233.10.195   <none>        443/TCP                                 3d19h

The client got disconnected constantly from the broker.

Steps to reproduce the issue Installed the kafka-operator and kafka-cluster with version 0.22.0

Additional context We suspect it's caused by https://github.com/banzaicloud/koperator/commit/28a116898cb89264a761a2bfe1d82c40a9528344, the two services got deleted and recreated with every reconciling flow.

panyuenlau commented 1 year ago

@daisywang-ca thanks for reporting the issue and we will look into it

panyuenlau commented 1 year ago

@daisywang-ca I think what you suspected was correct, this bug is caused by the deleteNonHeadlessServices function that was introduced by the commit that you've linked

fquinino commented 1 year ago

I'm encountering the same issue here. With my previous installation using version v0.21.2, nodePorts worked properly. After upgrading, the services type nodePort for each broker are being deleted and recreated every few seconds, causing instability in the cluster. I tried a fresh installation and still faced the same issue.

panyuenlau commented 1 year ago

Thanks for confirming the issue, @fquinino. We will try to fix the issue ASAP and drop a patch release

panyuenlau commented 1 year ago

BTW, @daisywang-ca @fquinino do you guys get to join our Slack channel where we can better communicate issues like this (and for some fun)?

panyuenlau commented 1 year ago

@daisywang-ca @fquinino v0.23.1 has the bug fix for this, please upgrade the operator versions accordingly