banzaicloud / koperator

Oh no! Yet another Apache Kafka operator for Kubernetes
Apache License 2.0
785 stars 197 forks source link

Underreplicated Brokers #916

Open PawelKalamba opened 1 year ago

PawelKalamba commented 1 year ago

Describe the bug After deleting unused Envoy connectors broke and restarted brokers stopped replicating to the expected number.

Steps to reproduce the issue:

  1. create kafkacluster with an external envoy and 5 brokers
  2. deploy the next versions changing Envoy external listener to an internal load balancer via annotations.
  3. delete service, deploy, cm of unused Envoy
  4. when the connector breaks, manually restart one broker after another.

Expected behavior The Kafka operator should recreate all brokers, and elect a leader and the messages should flow normally.

Additional context We have a cluster set up from 5 brokers (here is the configuration CustomKafkaCluster.txt But only 2 out of 5 were started by an operator This is describe from kubernetes DescribeKafka0.txt Now we have only kafka-0 and kafka-1 and 5 services with only two having updated ports for envoy. We tried to deploy the last working version but with no luck. Same with scaling down kafka-operator zookeper-operator then deleting zookeeper and broker pods and then scaling it up. We also noticed that our KafkaCluster is on neverending ClusterRollingUpgrading

Also, I'm attaching some logs from related pods. LogsFromKafkaOperator.txt LogsFromZKOperator.txt LogsFromInternalEnvoy.txt

We also noticed that some of our partitions don't have elected leader Topic: TopicName TopicId: topicID PartitionCount: 12 ReplicationFactor: 2 Configs: cleanup.policy=compact,delete,retention.ms=604800000,unclean.leader.election.enable=true Topic: TopicName Partition: 0 Leader: 0 Replicas: 3,0 Isr: 0 Topic: TopicName Partition: 1 Leader: 0 Replicas: 0,4 Isr: 0 Topic: TopicName Partition: 2 Leader: 1 Replicas: 4,1 Isr: 1 Topic: TopicName Partition: 3 Leader: 1 Replicas: 1,2 Isr: 1 Topic: TopicName Partition: 4 Leader: none Replicas: 2,3 Isr: 3 Topic TopicName Partition: 5 Leader: none Replicas: 3,4 Isr: 4 Topic: TopicName Partition: 6 Leader: 0 Replicas: 0,1 Isr: 1,0 Topic: TopicName Partition: 7 Leader: none Replicas: 4,2 Isr: 4 Topic: TopicName Partition: 8 Leader: 1 Replicas: 1,3 Isr: 1 Topic: TopicName Partition: 9 Leader: 0 Replicas: 2,0 Isr: 0 Topic: TopicName Partition: 10 Leader: 1 Replicas: 3,1 Isr: 1 Topic: TopicName Partition: 11 Leader: 0 Replicas: 0,2 Isr: 0

Do you know the logic and order for starting Kafka brokers by the operator - does it starts one by one and wait for a kind of health and then what's this healthcheck or does it start all at once?

panyuenlau commented 1 year ago

Hi @PawelKalamba,

Do you know the logic and order for starting Kafka brokers by the operator - does it starts one by one and wait for a kind of health and then what's this healthcheck or does it start all at once?

  1. The operator brings up Kafka brokers one by one: https://github.com/banzaicloud/koperator/blob/08b2a9f2197f1f9528a17966e47da5763a88532d/pkg/resources/kafka/kafka.go#L311-L362 in a specific order: https://github.com/banzaicloud/koperator/blob/08b2a9f2197f1f9528a17966e47da5763a88532d/pkg/resources/kafka/kafka.go#L1263-L1269
  2. health check: by default there isn't really any health checks for the broker pods, @bartam1 @balassai @pregnor please correct me if I am mistaken about this

I did some quick checks in the source code but couldn't find an appropriate reasoning about the behavior that you see

bartam1 commented 1 year ago

Dear @PawelKalamba, what is the version of the Koperator that generates the log LogsFromKafkaOperator.txt I can see cruise-control:2.5.68 which is pretty old one. Can you try our latest release ? https://github.com/banzaicloud/koperator/releases/tag/v0.22.0

panyuenlau commented 1 year ago

Hey @PawelKalamba 👋 , any updates on this would be appreciated. BTW, it would also be great to join our Slack channel where we can better communicate on issues / questions like this so other members in the community are also aware of the potential issues