[bitnami/kafka] Kafka pods randomly get killed, reason unclear

florian-besser commented 2 weeks ago

Name and Version

oci://registry-1.docker.io/bitnamicharts/kafka:29.3.4

What architecture are you using?

arm64

What steps will reproduce the bug?

Note that this seems to be a random issue; we have deployed the chart as described in 3 environments, only one of which shows erroneous behavior.

Are you using any custom parameters or values?

heapOpts: "-Xmx1536m -Xms1536m"
controller:
  heapOpts: "-Xmx1536m -Xms1536m"
  resources:
    requests:
      cpu: 750m
      memory: 2Gi
    limits:
      cpu: 750m
      memory: 2Gi
  podAnnotations:
    "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
extraConfigYaml:
  "authorizer.class.name": "org.apache.kafka.metadata.authorizer.StandardAuthorizer"
  "super.users": "User:controller_user;User:inter_broker_user"
listeners:
  client:
    containerPort: 9092
    protocol: SASL_SSL
    name: CLIENT
  controller:
    name: CONTROLLER
    containerPort: 9093
    protocol: SASL_SSL
    sslClientAuth: "required"
sasl:
  enabledMechanisms: PLAIN #,SCRAM-SHA-256,SCRAM-SHA-512
  client:
    users: <censored>
    passwords: <censored>
tls:
  type: PEM
  pemChainIncluded: false
  existingSecret: <censored>
  keystorePassword: <censored>
  truststorePassword: <censored>
metrics:
  jmx:
    enabled: true
    service:
      annotations:
        "prometheus.io/path": "/metrics"
        "prometheus.io/port": "{{ .Values.metrics.jmx.service.ports.metrics }}"
        "prometheus.io/scrape": "true"

What is the expected behavior?

The pods should consistently be healthy / should not die.

What do you see instead?

Seemingly randomly the pods are terminated.

I checked out /dev/termination-log for the root cause of the termination:

command terminated with exit code 137

At the same time I see the following in the logs:

[2024-07-10 04:38:46,918] INFO [GroupCoordinator 1]: Dynamic member with unknown member id joins group gsg.backend-consumers-Reporting-server in Stable state. Created a new member id gsg.consumer-server-40ac6f5b-3e4e-4038-811f-9b47a58839f8 and request the member to rejoin with this id. (kafka.coordinator.group.GroupCoordinator)
[2024-07-10 04:38:46,920] INFO [GroupCoordinator 1]: Preparing to rebalance group gsg.backend-consumers-Reporting-server in state PreparingRebalance with old generation 96 (__consumer_offsets-46) (reason: Adding new member gsg.consumer-server-40ac6f5b-3e4e-4038-811f-9b47a58839f8 with group instance id None; client reason: not provided) (kafka.coordinator.group.GroupCoordinator)
[2024-07-10 04:38:54,149] INFO [GroupCoordinator 1]: Member gsg.consumer-server-7e09f6b3-0b16-4ca6-ba79-a122c34533c2 in group gsg.backend-consumers-Reporting-server has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2024-07-10 04:38:54,149] INFO [GroupCoordinator 1]: Stabilized group gsg.backend-consumers-Reporting-server generation 97 (__consumer_offsets-46) with 1 members (kafka.coordinator.group.GroupCoordinator)
[2024-07-10 04:38:54,162] INFO [GroupCoordinator 1]: Assignment received from leader gsg.consumer-server-40ac6f5b-3e4e-4038-811f-9b47a58839f8 for group gsg.backend-consumers-Reporting-server for generation 97. The group has 1 members, 0 of which are static. (kafka.coordinator.group.GroupCoordinator)
[2024-07-10 05:05:55,781] INFO Terminating process due to signal SIGTERM (org.apache.kafka.common.utils.LoggingSignalHandler)
[2024-07-10 05:05:55,786] INFO [BrokerServer id=1] Transition from STARTED to SHUTTING_DOWN (kafka.server.BrokerServer)
[2024-07-10 05:05:55,786] INFO [BrokerServer id=1] shutting down (kafka.server.BrokerServer)
[2024-07-10 05:05:55,789] INFO [BrokerLifecycleManager id=1] Beginning controlled shutdown. (kafka.server.BrokerLifecycleManager)
[2024-07-10 05:05:55,790] INFO [QuorumController id=1] Unfenced broker 1 has requested and been granted a controlled shutdown. (org.apache.kafka.controller.BrokerHeartbeatManager)
[2024-07-10 05:05:55,794] INFO [QuorumController id=1] enterControlledShutdown[1]: changing 64 partition(s) (org.apache.kafka.controller.ReplicationControlManager)
[2024-07-10 05:05:55,794] INFO [QuorumController id=1] Replayed BrokerRegistrationChangeRecord modifying the registration for broker 1: BrokerRegistrationChangeRecord(brokerId=1, brokerEpoch=1221446, fenced=0, inControlledShutdown=1, logDirs=[]) (org.apache.kafka.controller.ClusterControlManager)
[2024-07-10 05:05:55,826] INFO [Broker id=1] Transitioning 64 partition(s) to local followers. (state.change.logger)
[2024-07-10 05:05:55,828] INFO [Broker id=1] Follower fintonia.Backend.AccountUserDto-0 starts at leader epoch 35 from offset 0 with partition epoch 35 and high watermark 0. Current leader is -1. Previous leader Some(-1) and previous leader epoch was 35. (state.change.logger)
[SNIP]
[2024-07-10 05:05:58,844] INFO [SocketServer listenerType=CONTROLLER, nodeId=1] Stopping socket server request processors (kafka.network.SocketServer)
[2024-07-10 05:05:58,855] INFO [SocketServer listenerType=CONTROLLER, nodeId=1] Stopped socket server request processors (kafka.network.SocketServer)
[2024-07-10 05:05:58,856] INFO [QuorumController id=1] QuorumController#beginShutdown: shutting down event queue. (org.apache.kafka.queue.KafkaEventQueue)
[2024-07-10 05:05:58,856] INFO [SocketServer listenerType=CONTROLLER, nodeId=1] Shutting down socket server (kafka.network.SocketServer)
[2024-07-10 05:05:58,869] INFO [SocketServer listenerType=CONTROLLER, nodeId=1] Shutdown completed (kafka.network.SocketServer)
[2024-07-10 05:05:58,871] INFO [data-plane Kafka Request Handler on Broker 1], shutting down (kafka.server.KafkaRequestHandlerPool)
[2024-07-10 05:05:58,878] INFO [data-plane Kafka Request Handler on Broker 1], shut down completely (kafka.server.KafkaRequestHandlerPool)
[2024-07-10 05:05:58,878] INFO [ExpirationReaper-1-AlterAcls]: Shutting down (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2024-07-10 05:05:58,905] INFO [ExpirationReaper-1-AlterAcls]: Stopped (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2024-07-10 05:05:58,905] INFO [ExpirationReaper-1-AlterAcls]: Shutdown completed (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2024-07-10 05:05:58,906] INFO [controller-1-ThrottledChannelReaper-Fetch]: Shutting down (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2024-07-10 05:05:58,906] INFO [controller-1-ThrottledChannelReaper-Fetch]: Stopped (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2024-07-10 05:05:58,907] INFO [controller-1-ThrottledChannelReaper-Fetch]: Shutdown completed (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2024-07-10 05:05:58,907] INFO [controller-1-ThrottledChannelReaper-Produce]: Shutting down (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2024-07-10 05:05:58,907] INFO [controller-1-ThrottledChannelReaper-Produce]: Stopped (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2024-07-10 05:05:58,907] INFO [controller-1-ThrottledChannelReaper-Produce]: Shutdown completed (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2024-07-10 05:05:58,908] INFO [controller-1-ThrottledChannelReaper-Request]: Shutting down (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2024-07-10 05:05:58,908] INFO [controller-1-ThrottledChannelReaper-Request]: Stopped (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2024-07-10 05:05:58,909] INFO [controller-1-ThrottledChannelReaper-Request]: Shutdown completed (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2024-07-10 05:05:58,909] INFO [controller-1-ThrottledChannelReaper-ControllerMutation]: Shutting down (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2024-07-10 05:05:58,909] INFO [controller-1-ThrottledChannelReaper-ControllerMutation]: Stopped (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2024-07-10 05:05:58,909] INFO [controller-1-ThrottledChannelReaper-ControllerMutation]: Shutdown completed (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2024-07-10 05:05:58,910] INFO [QuorumController id=1] closed event queue. (org.apache.kafka.queue.KafkaEventQueue)
[2024-07-10 05:05:58,918] INFO [SharedServer id=1] Stopping SharedServer (kafka.server.SharedServer)
[2024-07-10 05:05:58,918] INFO [MetadataLoader id=1] beginShutdown: shutting down event queue. (org.apache.kafka.queue.KafkaEventQueue)
[2024-07-10 05:05:58,918] INFO [SnapshotGenerator id=1] close: shutting down event queue. (org.apache.kafka.queue.KafkaEventQueue)
[2024-07-10 05:05:58,918] INFO [SnapshotGenerator id=1] closed event queue. (org.apache.kafka.queue.KafkaEventQueue)
[2024-07-10 05:05:58,919] INFO [MetadataLoader id=1] closed event queue. (org.apache.kafka.queue.KafkaEventQueue)
[2024-07-10 05:05:58,920] INFO [SnapshotGenerator id=1] closed event queue. (org.apache.kafka.queue.KafkaEventQueue)
[2024-07-10 05:05:58,933] INFO Metrics scheduler closed (org.apache.kafka.common.metrics.Metrics)
[2024-07-10 05:05:58,933] INFO Closing reporter org.apache.kafka.common.metrics.JmxReporter (org.apache.kafka.common.metrics.Metrics)
[2024-07-10 05:05:58,933] INFO Metrics reporters closed (org.apache.kafka.common.metrics.Metrics)
[2024-07-10 05:05:58,933] INFO App info kafka.server for 1 unregistered (org.apache.kafka.common.utils.AppInfoParser)
[2024-07-10 05:05:58,934] INFO App info kafka.server for 1 unregistered (org.apache.kafka.common.utils.AppInfoParser)
[No further logs present, I assume the container self-terminated here]

This is a bit weird to me as a SIGKILL (137) should not allow for a graceful shutdown; the log seems to imply a SIGTERM was instead sent, enabling graceful shutdown. Still, what I'm trying to figure out is who sent that command, and why?

I thought about a potential OOMKilled issue, but the Prometheus graphs show the JVM memory usage was way below the maximum:

Also OOMKilled is a SIGKILL, and thus should not yield graceful shutdowns.

I did a quick kubectl describe pod -n kafka kafka-controller-1 to check whether there's any information there:

Name:             kafka-controller-1
Namespace:        kafka
Priority:         0
Service Account:  kafka
Node:             ip-10-0-3-176.ap-southeast-1.compute.internal/10.0.3.176
Start Time:       Wed, 10 Jul 2024 13:07:44 +0800
Labels:           app.kubernetes.io/component=controller-eligible
                  app.kubernetes.io/instance=kafka
                  app.kubernetes.io/managed-by=Helm
                  app.kubernetes.io/name=kafka
                  app.kubernetes.io/part-of=kafka
                  app.kubernetes.io/version=3.7.0
                  apps.kubernetes.io/pod-index=1
                  controller-revision-hash=kafka-controller-54b66c4dc9
                  helm.sh/chart=kafka-29.3.4
                  statefulset.kubernetes.io/pod-name=kafka-controller-1
Annotations:      checksum/configuration: dd335689773274079c8603bfc0f8b37f961f55f6437e12054f8fadd9f6f3d3a1
                  checksum/jmx-configuration: 5f3d12b6f61db55ab411b5b7403f1a88bd6f04dfa913d9ab97a717566396c2b1
                  checksum/passwords-secret: a8824f6a328dd7687299bb413ced02a938a489ef1ebefc0841c0394fa8ee5797
                  cluster-autoscaler.kubernetes.io/safe-to-evict: false
Status:           Running
SeccompProfile:   RuntimeDefault
IP:               10.0.3.225
IPs:
  IP:           10.0.3.225
Controlled By:  StatefulSet/kafka-controller
Init Containers:
  kafka-init:
    Container ID:  containerd://622cd096b94b93f55d74dd8d907cb74fd48ed06e098210427107c7a4e1052819
    Image:         docker.io/bitnami/kafka:3.7.0-debian-12-r7
    Image ID:      docker.io/bitnami/kafka@sha256:9abe5ce1ce14e0ff3041075c5183202fb5c47ea28ad270bfcb067f7ade304d8b
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
    Args:
      -ec
      /scripts/kafka-init.sh

    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 10 Jul 2024 13:08:16 +0800
      Finished:     Wed, 10 Jul 2024 13:08:17 +0800
    Ready:          True
    Restart Count:  0
    Environment:
      BITNAMI_DEBUG:                  false
      MY_POD_NAME:                    kafka-controller-1 (v1:metadata.name)
      KAFKA_VOLUME_DIR:               /bitnami/kafka
      KAFKA_MIN_ID:                   0
      KAFKA_CLIENT_USERS:             provisioning,redpanda,smithy,arkhon,gsg
      KAFKA_CLIENT_PASSWORDS:         <set to the key 'client-passwords' in secret 'kafka-user-passwords'>  Optional: false
      KAFKA_INTER_BROKER_USER:        inter_broker_user
      KAFKA_INTER_BROKER_PASSWORD:    <set to the key 'inter-broker-password' in secret 'kafka-user-passwords'>  Optional: false
      KAFKA_CONTROLLER_USER:          controller_user
      KAFKA_CONTROLLER_PASSWORD:      <set to the key 'controller-password' in secret 'kafka-user-passwords'>  Optional: false
      KAFKA_TLS_TYPE:                 PEM
      KAFKA_TLS_KEYSTORE_PASSWORD:    <set to the key 'keystore-password' in secret 'kafka-tls-passwords'>    Optional: false
      KAFKA_TLS_TRUSTSTORE_PASSWORD:  <set to the key 'truststore-password' in secret 'kafka-tls-passwords'>  Optional: false
    Mounts:
      /bitnami/kafka from data (rw)
      /certs from kafka-shared-certs (rw)
      /config from kafka-config (rw)
      /configmaps from kafka-configmaps (rw)
      /mounted-certs from kafka-certs (ro)
      /scripts from scripts (rw)
      /secret-config from kafka-secret-config (rw)
      /tmp from tmp (rw)
Containers:
  kafka:
    Container ID:   containerd://e8daee2842a12a6856af6ffc4cc5ed9181e1ac8967cb14d992a4e2241ac2ba7f
    Image:          docker.io/bitnami/kafka:3.7.0-debian-12-r7
    Image ID:       docker.io/bitnami/kafka@sha256:9abe5ce1ce14e0ff3041075c5183202fb5c47ea28ad270bfcb067f7ade304d8b
    Ports:          9093/TCP, 9092/TCP, 9094/TCP
    Host Ports:     0/TCP, 0/TCP, 0/TCP
    State:          Running
      Started:      Wed, 10 Jul 2024 13:08:18 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:                750m
      ephemeral-storage:  1Gi
      memory:             2Gi
    Requests:
      cpu:                750m
      ephemeral-storage:  1Gi
      memory:             2Gi
    Liveness:             exec [pgrep -f kafka] delay=10s timeout=5s period=10s #success=1 #failure=3
    Readiness:            tcp-socket :controller delay=5s timeout=5s period=10s #success=1 #failure=6
    Environment:
      BITNAMI_DEBUG:           false
      KAFKA_HEAP_OPTS:         -Xmx1536m -Xms1536m
      KAFKA_KRAFT_CLUSTER_ID:  <set to the key 'kraft-cluster-id' in secret 'kafka-kraft-cluster-id'>  Optional: false
      JMX_PORT:                5555
    Mounts:
      /bitnami/kafka from data (rw)
      /opt/bitnami/kafka/config/certs from kafka-shared-certs (ro)
      /opt/bitnami/kafka/config/server.properties from kafka-config (rw,path="server.properties")
      /opt/bitnami/kafka/logs from logs (rw)
      /tmp from tmp (rw)
  jmx-exporter:
    Container ID:  containerd://920fc766e82ca017c5b2e6333ec3153388d4137f046413f640a0cf47c52300b0
    Image:         docker.io/bitnami/jmx-exporter:1.0.1-debian-12-r1
    Image ID:      docker.io/bitnami/jmx-exporter@sha256:b1069336484005bb737eaa0eee466195c8f2de125624941a2193eaeac3bc1c91
    Port:          5556/TCP
    Host Port:     0/TCP
    Command:
      java
    Args:
      -XX:MaxRAMPercentage=100
      -XshowSettings:vm
      -jar
      jmx_prometheus_httpserver.jar
      5556
      /etc/jmx-kafka/jmx-kafka-prometheus.yml
    State:          Running
      Started:      Wed, 10 Jul 2024 13:08:34 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:                375m
      ephemeral-storage:  1Gi
      memory:             384Mi
    Requests:
      cpu:                250m
      ephemeral-storage:  50Mi
      memory:             256Mi
    Environment:          <none>
    Mounts:
      /etc/jmx-kafka from jmx-config (rw)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  data-kafka-controller-1
    ReadOnly:   false
  kafka-configmaps:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      kafka-controller-configuration
    Optional:  false
  kafka-secret-config:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kafka-config:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  scripts:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      kafka-scripts
    Optional:  false
  jmx-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      kafka-jmx-configuration
    Optional:  false
  kafka-shared-certs:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kafka-certs:
    Type:                Projected (a volume that contains injected data from multiple sources)
    SecretName:          kafka-certs
    SecretOptionalName:  <nil>
  logs:
    Type:        EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:      
    SizeLimit:   <unset>
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none>

Note that I'm not seeing any reason why the previously killed pod was killed; the pod that has since replaced the killed pod is running fine.

I also checked whether the pods were terminated due to cluster scale down, but the issue persisted even after adding the "cluster-autoscaler.kubernetes.io/safe-to-evict": "false" annotation. Checking the autoscaler logs shows that no scale-down took place at the time.

Additional information

At the moment I can say that the pods are behaving incorrectly but I'm unable to ascertain why exactly. Still, as this results in a pod being killed and restarted I categorize this as an issue for now - any help in what other pieces of info I should deliver is much appreciated.

Note that this issue only happens as far as I can see for Kafka; no other app in the K8s cluster is showing this behavior.

javsalgar commented 2 weeks ago

In order to discard a potential OOMKilled, I would disable the resources limit/requests when deploying in that platform, that should give us some clues if the issue is related to that.

florian-besser commented 2 weeks ago

Disabled the limits, now waiting for issue to reappear. Note that it can take hours and days between occurrences, so no immediate feedback sadly :/

florian-besser commented 2 weeks ago

The issue reoccurred, this time I was able to capture some events with kubectl get events -n kafka --sort-by='.metadata.creationTimestamp':

30m         Normal    Sync                     ingress/ingress-kafka          Scheduled for sync
29m         Normal    Killing                  pod/kafka-controller-0         Stopping container jmx-exporter
29m         Normal    Killing                  pod/kafka-controller-0         Stopping container kafka
29m         Warning   FailedToUpdateEndpoint   endpoints/kafka                Failed to update endpoint kafka/kafka: Operation cannot be fulfilled on endpoints "kafka": the object has been modified; please apply your changes to the latest version and try again
29m         Warning   RecreatingFailedPod      statefulset/kafka-controller   StatefulSet kafka/kafka-controller is recreating failed Pod kafka-controller-0

Note that the nodes still have a memory limit (set by AWS), so if you have any insights where I could find more information on e.g. the K8s cluster killing pods I'd be keen to look there as well. Our nodes each have 8GM RAM, so they should be able to accommodate the Kafka pods with the current settings just fine (all pods are deployed on separate nodes of course) - but there are other pods on these nodes, so it is theoretically possible that the nodes run out of memory.

I say theoretically:

kubectl top nodes
NAME                                            CPU(cores)   CPU%        MEMORY(bytes)   MEMORY%     
ip-10-0-1-152.ap-southeast-1.compute.internal   102m         5%          1525Mi          21%         
ip-10-0-1-92.ap-southeast-1.compute.internal    298m         14%         2894Mi          40%         
ip-10-0-2-21.ap-southeast-1.compute.internal    203m         10%         3957Mi          55%         
ip-10-0-2-83.ap-southeast-1.compute.internal    461m         23%         2766Mi          38%         
ip-10-0-3-164.ap-southeast-1.compute.internal   295m         14%         2584Mi          36%         
ip-10-0-3-30.ap-southeast-1.compute.internal    314m         15%         4178Mi          58%         
ip-10-0-3-51.ap-southeast-1.compute.internal    207m         10%         1949Mi          27%

Our cluster automatically scales out on memory pressure, so this should not happen.

Unfortunately I'm still unsure where I should look to find more information as to why the pods failed.

javsalgar commented 2 weeks ago

Hi,

In this sense, I would double check with the infrastructure providers to see if they can help debug this kind of situations or if they can confirm this is a memory issue. If they are in the killing state it seems to me that probably at some point there is a memory peak forces the eviction.

florian-besser commented 1 week ago

It looks like we suffered from pod preemptions. Will continue to monitor; but this seems to be unrelated to Kafka as such. Kafka were in our case the biggest pods around which made them a prime target for preemption, plus we did not set a priority for the pods.

javsalgar commented 1 week ago

Thanks for letting us know!

bitnami / charts