Open MikeNikolayev opened 1 week ago
Note: this kafka runs in k3s cluster with other services that create topics in kafka and consume info. minimal traffic but it exists.
this is the comment i got from ChatGPT paid version. I wonder what the community would say about it;
Root Cause:
When you scale up the Kafka cluster and simultaneously change the replication factors for internal topics, the existing partitions on kafka-0 (which have a replication factor of 1) become incompatible with the new configuration expecting a replication factor of 3. This mismatch leads to log truncation errors because Kafka's replication protocol cannot reconcile the differences between the old and new log structures.
Solution: 1 Avoid Changing Internal Topic Replication Factors During Scaling Keep Internal Replication Factors Unchanged 2 Scale the Cluster First (from 1 to 3 pods) 3 Manually Increase Replication Factors for Internal Topics using kafka-reassign-partitions.sh 4 Update Configuration if Necessary Modify the Helm Chart Values offsets.topic.replication.factor=3 transaction.state.log.replication.factor=3 and run helm upgrade
Hi, the issue may not be directly related to the Bitnami container image/Helm chart, but rather to how the application is being utilized, configured in your specific environment, or tied to a particular scenario that is not easy to reproduce on our side.
If you think that's not the case and want to contribute a solution, we welcome you to create a pull request. The Bitnami team is excited to review your submission and offer feedback. You can find the contributing guidelines here.
Your contribution will greatly benefit the community. Feel free to reach out if you have any questions or need assistance.
Suppose you have any questions about the application, customizing its content, or technology and infrastructure usage. In that case, we highly recommend that you refer to the forums and user guides provided by the project responsible for the application or technology.
With that said, we'll keep this ticket open until the stale bot automatically closes it, in case someone from the community contributes valuable insights.
Name and Version
bitnami/kafka 29.3.13
What architecture are you using?
amd64
What steps will reproduce the bug?
install kafka helm chart on 1 node k3s cluster add 2 more nodes to k3s perform
helm upgrade --install kafka ...
to apply cluster values and have 3 pods instead of 1 the problem occurs in 50% of the cases but on all Openstack labs in all geo locations - Asia, America, EuropeAre you using any custom parameters or values?
yes. we use the Kafka orig chart as a dependency of our own chart with our version and customize Kafka. In addition, we use Linkerd service mesh, so that each Kafka pod (and other pods in our cluster) has Linkerd sidecar Example:
default values of our chart:
and linkerd:
What is the expected behavior?
the expected behavior is to see all 3 kafka pods running healthy
What do you see instead?
2 new kafka pods run successfully kafka-1 and kafka-2 while once the original kafka-0 pod is getting restarted with new configuration mentioned above - it goes into CrashLoopBack
Additional information
log error: