GoogleCloudPlatform / click-to-deploy

Source for Google Click to Deploy solutions listed on Google Cloud Marketplace.
Apache License 2.0
730 stars 447 forks source link

Mismatched datadir volumeMount causing zookeeper dataloss after pod recreation #2812

Open Kiritow opened 3 days ago

Kiritow commented 3 days ago

Category:

Kubernetes apps

Type:


We are using the GKE click-to-deploy feature to deploy Kafka in our cluster, but after the underlying nodes are replaced, Kafka fails to start properly and throws the following error:

ERROR Exiting Kafka due to fatal exception during startup. (kafka.Kafka$)
java.lang.RuntimeException: Invalid cluster.id in: /kafka/logs/meta.properties. Expected <redacted>, but read <redacted>

After our investigation, we found that there might be an issue with the Zookeeper configuration in the chart. The datadir volumeMount configuration for Zookeeper is inconsistent with the value of the ZK_DATA_DIR environment variable, causing data loss in Zookeeper after pod migration.

In k8s/kafka/chart/kafka/templates/zk-statefulset.yaml, ZK_DATA_DIR is set to /data but volumeMounts is configured as:

volumeMounts:
- name: config
  mountPath: /config-scripts
- name: datadir
  mountPath: /opt/zookeeper

when we checked the /opt/zookeeper folder inside the pod, we found it was empty.