pod crashes after kafka scale out from 1 to 3

MikeNikolayev commented 1 week ago

Name and Version

bitnami/kafka 29.3.13

What architecture are you using?

amd64

What steps will reproduce the bug?

install kafka helm chart on 1 node k3s cluster add 2 more nodes to k3s perform helm upgrade --install kafka ... to apply cluster values and have 3 pods instead of 1 the problem occurs in 50% of the cases but on all Openstack labs in all geo locations - Asia, America, Europe

Are you using any custom parameters or values?

yes. we use the Kafka orig chart as a dependency of our own chart with our version and customize Kafka. In addition, we use Linkerd service mesh, so that each Kafka pod (and other pods in our cluster) has Linkerd sidecar Example:

helm upgrade --install kafka /tmp/kafka-.tgz --wait --timeout 10m -f /tmp/helm-values/kafka/cluster.values.yaml -f /tmp/kafka/linkerd.values.yml

default values of our chart:

kafka:
  metrics:
    kafka:
      enabled: false
  enabled: true
  networkPolicy:
    enabled: true
    egressRules:
      customRules: []
  fullnameOverride: "kafka"
  nameOverride: "kafka"
  image:
    registry: localhost
    repository: bitnami/kafka
  containerSecurityContext:
    capabilities:
      drop:
        - ALL
    seccompProfile:
      type: RuntimeDefault
  persistence:
    size: 1Gi
  controller:
    resources:
      limits:
        cpu: "1"
        memory: 2.5Gi
      requests:
        cpu: 200m
        memory: 100Mi
    replicaCount: 1
    heapOpts: -Xms256m -Xmx2048m
    podAnnotations:
      linkerd.io/inject: "enabled"
      kubectl.kubernetes.io/default-container: "kafka"
      config.linkerd.io/image-pull-policy: "Never"
      config.linkerd.io/proxy-cpu-request: 50m
      config.linkerd.io/proxy-cpu-limit: 500m
      config.linkerd.io/proxy-memory-request: 48Mi
      config.linkerd.io/proxy-memory-limit: 512Mi
      config.linkerd.io/proxy-outbound-connect-timeout: 5000ms
  listeners:
    client:
      containerPort: 9092
      protocol: PLAINTEXT
      name: CLIENT
      sslClientAuth: ""
    controller:
      name: CONTROLLER
      containerPort: 9093
      protocol: PLAINTEXT
      sslClientAuth: ""
    interbroker:
      containerPort: 9094
      protocol: PLAINTEXT
      name: INTERNAL
      sslClientAuth: ""
    external:
      containerPort: 9095
      protocol: PLAINTEXT
      name: EXTERNAL
      sslClientAuth: ""
  extraConfig: |
    num.partitions=10
    num.network.threads=3
    num.io.threads=8
    min.insync.replicas=1
    socket.send.buffer.bytes=102400
    socket.receive.buffer.bytes=102400
    socket.request.max.bytes=104857600
    log.dirs=/bitnami/kafka/data
    num.recovery.threads.per.data.dir=1
    offsets.topic.replication.factor=1
    transaction.state.log.replication.factor=1
    transaction.state.log.min.isr=1
    log.flush.interval.messages=10000
    log.flush.interval.ms=1000
    log.retention.hours=1
    log.roll.hours=3
    log.retention.bytes=250000000
    log.segment.bytes=1073741824
    log.retention.check.interval.ms=300000
    allow.everyone.if.no.acl.found=true
    auto.create.topics.enable=true
    default.replication.factor=1
    max.partition.fetch.bytes=1048576
    max.request.size=1048576
    message.max.bytes=20000000`

    while cluster values are overwriting some of them:

`kafka:
  controller:
    replicaCount: 3
  extraConfig: |
    num.partitions=10
    num.network.threads=3
    num.io.threads=8
    min.insync.replicas=1
    socket.send.buffer.bytes=102400
    socket.receive.buffer.bytes=102400
    socket.request.max.bytes=104857600
    log.dirs=/bitnami/kafka/data
    num.recovery.threads.per.data.dir=1
    offsets.topic.replication.factor=3
    transaction.state.log.replication.factor=3
    transaction.state.log.min.isr=1
    log.flush.interval.messages=10000
    log.flush.interval.ms=1000
    log.retention.hours=1
    log.roll.hours=3
    log.retention.bytes=250000000
    log.segment.bytes=1073741824
    log.retention.check.interval.ms=300000
    allow.everyone.if.no.acl.found=true
    auto.create.topics.enable=true
    default.replication.factor=3
    max.partition.fetch.bytes=1048576
    max.request.size=1048576
    message.max.bytes=20000000

and linkerd:

kafka:
  controller:
    podAnnotations:
      linkerd-version: "stable-2.13.7"

What is the expected behavior?

the expected behavior is to see all 3 kafka pods running healthy

What do you see instead?

2 new kafka pods run successfully kafka-1 and kafka-2 while once the original kafka-0 pod is getting restarted with new configuration mentioned above - it goes into CrashLoopBack

Additional information

log error:

ERROR Encountered fatal fault: Unexpected error in raft IO thread (org.apache.kafka.server.fault.ProcessTerminatingFaultHandler)
java.lang.IllegalArgumentException: Attempt to truncate to offset 0, which is below the current high watermark 7860
        at kafka.raft.KafkaMetadataLog.truncateTo(KafkaMetadataLog.scala:159)
        at org.apache.kafka.raft.ReplicatedLog.truncateToEndOffset(ReplicatedLog.java:227)
        at kafka.raft.KafkaMetadataLog.truncateToEndOffset(KafkaMetadataLog.scala:41)
        at org.apache.kafka.raft.KafkaRaftClient.handleFetchResponse(KafkaRaftClient.java:1135)
        at org.apache.kafka.raft.KafkaRaftClient.handleResponse(KafkaRaftClient.java:1609)
        at org.apache.kafka.raft.KafkaRaftClient.handleInboundMessage(KafkaRaftClient.java:1735)
        at org.apache.kafka.raft.KafkaRaftClient.poll(KafkaRaftClient.java:2310)
        at kafka.raft.KafkaRaftManager$RaftIoThread.doWork(RaftManager.scala:64)
        at org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:130)
[2024-10-22 19:26:41,522] INFO [controller-0-ThrottledChannelReaper-Fetch]: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)

MikeNikolayev commented 1 week ago

Note: this kafka runs in k3s cluster with other services that create topics in kafka and consume info. minimal traffic but it exists.

MikeNikolayev commented 1 week ago

this is the comment i got from ChatGPT paid version. I wonder what the community would say about it;

Root Cause:

When you scale up the Kafka cluster and simultaneously change the replication factors for internal topics, the existing partitions on kafka-0 (which have a replication factor of 1) become incompatible with the new configuration expecting a replication factor of 3. This mismatch leads to log truncation errors because Kafka's replication protocol cannot reconcile the differences between the old and new log structures.

Solution: 1 Avoid Changing Internal Topic Replication Factors During Scaling Keep Internal Replication Factors Unchanged 2 Scale the Cluster First (from 1 to 3 pods) 3 Manually Increase Replication Factors for Internal Topics using kafka-reassign-partitions.sh 4 Update Configuration if Necessary Modify the Helm Chart Values offsets.topic.replication.factor=3 transaction.state.log.replication.factor=3 and run helm upgrade

carrodher commented 1 week ago

Hi, the issue may not be directly related to the Bitnami container image/Helm chart, but rather to how the application is being utilized, configured in your specific environment, or tied to a particular scenario that is not easy to reproduce on our side.

If you think that's not the case and want to contribute a solution, we welcome you to create a pull request. The Bitnami team is excited to review your submission and offer feedback. You can find the contributing guidelines here.

Your contribution will greatly benefit the community. Feel free to reach out if you have any questions or need assistance.

Suppose you have any questions about the application, customizing its content, or technology and infrastructure usage. In that case, we highly recommend that you refer to the forums and user guides provided by the project responsible for the application or technology.

With that said, we'll keep this ticket open until the stale bot automatically closes it, in case someone from the community contributes valuable insights.

bitnami / charts