[bitnami/kafka] Broker authentication error

robertdumitrescu commented 7 months ago

Name and Version

bitnami/kafka 26.9.0

What architecture are you using?

None

What steps will reproduce the bug?

I am attempting to deploy a secure Kafka cluster in Kraft mode with 3 brokers using the Bitnami Kafka Helm chart on a local Kubernetes cluster managed by Minikube. Despite following various guides and collating configuration snippets from Stack Overflow and GitHub issues, the cluster fails to start. The deployment targets a development environment on Minikube v1.32.0 and Docker Desktop for Windows v4.27.2. This report includes my values.yaml configuration, along with consumer.properties and system.properties files, to provide a comprehensive overview of my setup.

Environment:

Helm Version: v3.13.2
Kubernetes Version: v1.28.3
Minikube Version: v1.32.0
Docker Desktop for Windows Version: v4.27.2
Bitnami Kafka Helm Chart Version: 26.9.0

Steps to Reproduce:

Set up Minikube locally using the specified version. Configure the Helm values as per the attached values.yaml. Deploy the Kafka cluster using the Bitnami Kafka Helm chart

Are you using any custom parameters or values?

This is the values file

image:
  debug: true
tolerations:
  - key: "node-role.kubernetes.io/control-plane"
    operator: "Equal"
    value: ""
    effect: "NoSchedule"
externalAccess:
  enabled: false
  autoDiscovery:
    enabled: true
kraft:
  enabled: true
  replicas: 3
  zookeeper:
    enabled: false
extraEnvVars:
  - name: KAFKA_ENABLE_KRAFT
    value: "true"
  - name: KAFKA_CFG_DELETE_TOPIC_ENABLE
    value: "true"  # Add this line to enable topic deletion
  - name: KAFKA_CFG_AUTO_CREATE_TOPICS_ENABLE
    value: "true"  # Add this line to enable topic auto-creation
  - name: KAFKA_CFG_SASL_MECHANISM_INTER_BROKER_PROTOCOL
    value: "PLAIN"

controller:
  persistence:
    enabled: true
    size: 8Gi
broker:
  persistence:
    enabled: true
    size: 8Gi
volumePermissions:
  enabled: true
listeners:
  interbroker:
    protocol: PLAINTEXT
serviceAccount:
  create: true
rbac:
  create: true

What is the expected behavior?

The Kafka cluster should be deployed successfully with 3 brokers running in secure mode and Kraft mode enabled, without any errors.

What do you see instead?

The deployment fails, resulting in authentication errors on all broker pods.

 (kafka.server.KafkaConfig)
[2024-02-21 23:01:21,225] INFO [BrokerServer id=2] Waiting for the broker to be unfenced (kafka.server.BrokerServer)
[2024-02-21 23:01:21,313] INFO [BrokerLifecycleManager id=2] The broker has been unfenced. Transitioning from RECOVERY to RUNNING. (kafka.server.BrokerLifecycleManager)
[2024-02-21 23:01:21,313] INFO [BrokerServer id=2] Finished waiting for the broker to be unfenced (kafka.server.BrokerServer)
[2024-02-21 23:01:21,313] INFO authorizerStart completed for endpoint INTERNAL. Endpoint is now READY. (org.apache.kafka.server.network.EndpointReadyFutures)
[2024-02-21 23:01:21,313] INFO authorizerStart completed for endpoint CLIENT. Endpoint is now READY. (org.apache.kafka.server.network.EndpointReadyFutures)
[2024-02-21 23:01:21,313] INFO [SocketServer listenerType=BROKER, nodeId=2] Enabling request processing. (kafka.network.SocketServer)
[2024-02-21 23:01:21,314] INFO Awaiting socket connections on 0.0.0.0:9092. (kafka.network.DataPlaneAcceptor)
[2024-02-21 23:01:21,314] INFO Awaiting socket connections on 0.0.0.0:9094. (kafka.network.DataPlaneAcceptor)
[2024-02-21 23:01:21,316] INFO [BrokerServer id=2] Waiting for all of the authorizer futures to be completed (kafka.server.BrokerServer)
[2024-02-21 23:01:21,316] INFO [BrokerServer id=2] Finished waiting for all of the authorizer futures to be completed (kafka.server.BrokerServer)
[2024-02-21 23:01:21,316] INFO [BrokerServer id=2] Waiting for all of the SocketServer Acceptors to be started (kafka.server.BrokerServer)
[2024-02-21 23:01:21,316] INFO [BrokerServer id=2] Finished waiting for all of the SocketServer Acceptors to be started (kafka.server.BrokerServer)
[2024-02-21 23:01:21,316] INFO [BrokerServer id=2] Transition from STARTING to STARTED (kafka.server.BrokerServer)
[2024-02-21 23:01:21,316] INFO Kafka version: 3.6.1 (org.apache.kafka.common.utils.AppInfoParser)
[2024-02-21 23:01:21,316] INFO Kafka commitId: 5e3c2b738d253ff5 (org.apache.kafka.common.utils.AppInfoParser)
[2024-02-21 23:01:21,316] INFO Kafka startTimeMs: 1708556481316 (org.apache.kafka.common.utils.AppInfoParser)
[2024-02-21 23:01:21,317] INFO [KafkaRaftServer nodeId=2] Kafka Server started (kafka.server.KafkaRaftServer)
[2024-02-21 23:01:21,551] INFO [QuorumController id=2] Replayed BrokerRegistrationChangeRecord modifying the registration for broker 2: BrokerRegistrationChangeRecord(brokerId=2, brokerEpoch=15, fenced=-1, inControlledShutdown=0) (org.apache.kafka.controller.ClusterControlManager)
[2024-02-21 23:05:32,606] INFO [SocketServer listenerType=BROKER, nodeId=2] Failed authentication with /10.244.0.20 (channelId=10.244.0.17:9092-10.244.0.20:53242-0) (Unexpected Kafka request of type METADATA during SASL handshake.) (org.apache.kafka.common.network.Selector)
[2024-02-21 23:05:33,415] INFO [SocketServer listenerType=BROKER, nodeId=2] Failed authentication with /10.244.0.20 (channelId=10.244.0.17:9092-10.244.0.20:53252-0) (Unexpected Kafka request of type METADATA during SASL handshake.) (org.apache.kafka.common.network.Selector)
[2024-02-21 23:05:34,223] INFO [SocketServer listenerType=BROKER, nodeId=2] Failed authentication with /10.244.0.20 (channelId=10.244.0.17:9092-10.244.0.20:45500-0) (Unexpected Kafka request of type METADATA during SASL handshake.) (org.apache.kafka.common.network.Selector)
[2024-02-21 23:05:35,839] INFO [SocketServer listenerType=BROKER, nodeId=2] Failed authentication with /10.244.0.20 (channelId=10.244.0.17:9092-10.244.0.20:45530-1) (Unexpected Kafka request of type METADATA during SASL handshake.) (org.apache.kafka.common.network.Selector)
[2024-02-21 23:05:37,050] INFO [SocketServer listenerType=BROKER, nodeId=2] Failed authentication with /10.244.0.20 (channelId=10.244.0.17:9092-10.244.0.20:45554-1) (Unexpected Kafka request of type METADATA during SASL handshake.) (org.apache.kafka.common.network.Selector)
[2024-02-21 23:05:37,453] INFO [SocketServer listenerType=BROKER, nodeId=2] Failed authentication with /10.244.0.20 (channelId=10.244.0.17:9092-10.244.0.20:45570-1) (Unexpected Kafka request of type METADATA during SASL handshake.) (org.apache.kafka.common.network.Selector)
[2024-02-21 23:05:38,312] INFO [SocketServer listenerType=BROKER, nodeId=2] Failed authentication with /10.244.0.20 (channelId=10.244.0.17:9092-10.244.0.20:45596-2) (Unexpected Kafka request of type METADATA during SASL handshake.) (org.apache.kafka.common.network.Selector)
[2024-02-21 23:05:39,523] INFO [SocketServer listenerType=BROKER, nodeId=2] Failed authentication with /10.244.0.20 (channelId=10.244.0.17:9092-10.244.0.20:45618-2) (Unexpected Kafka request of type METADATA during SASL handshake.) (org.apache.kafka.common.network.Selector)
[2024-02-21 23:05:42,812] INFO [SocketServer listenerType=BROKER, nodeId=2] Failed authentication with /10.244.0.20 (channelId=10.244.0.17:9092-10.244.0.20:45686-2) (Unexpected Kafka request of type METADATA during SASL handshake.) (org.apache.kafka.common.network.Selector)

Additional information

$ I have no name!@kafka-controller-2:/$ cat /opt/bitnami/kafka/config/server.properties | grep -Ev "^$|^#"
bootstrap.servers=localhost:9092
compression.type=none
I have no name!@kafka-controller-2:/$ ^C
I have no name!@kafka-controller-2:/$ cat /opt/bitnami/kafka/config/server.properties | grep -Ev "^$|^#"
listeners=CLIENT://:9092,INTERNAL://:9094,CONTROLLER://:9093
advertised.listeners=CLIENT://kafka-controller-2.kafka-controller-headless.processing.svc.cluster.local:9092,INTERNAL://kafka-controller-2.kafka-controller-headless.processing.svc.cluster.local:9094
listener.security.protocol.map=CLIENT:SASL_PLAINTEXT,INTERNAL:PLAINTEXT,CONTROLLER:SASL_PLAINTEXT
process.roles=controller,broker
node.id=2
controller.listener.names=CONTROLLER
controller.quorum.voters=0@kafka-controller-0.kafka-controller-headless.processing.svc.cluster.local:9093,1@kafka-controller-1.kafka-controller-headless.processing.svc.cluster.local:9093,2@kafka-controller-2.kafka-controller-headless.processing.svc.cluster.local:9093
sasl.mechanism.controller.protocol=PLAIN
listener.name.controller.sasl.enabled.mechanisms=PLAIN
listener.name.controller.plain.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="controller_user" password="YkB04iZlPP" user_controller_user="YkB04iZlPP";
log.dir=/bitnami/kafka/data
logs.dir=/opt/bitnami/kafka/logs
sasl.enabled.mechanisms=PLAIN,SCRAM-SHA-256,SCRAM-SHA-512
inter.broker.listener.name=INTERNAL
listener.name.client.plain.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required user_user1="bdKl9WeN16";
listener.name.client.scram-sha-256.sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required;
listener.name.client.scram-sha-512.sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required;

$ I have no name!@kafka-controller-2:/$ cat /opt/bitnami/kafka/config/producer.properties | grep -Ev "^$|^#"
bootstrap.servers=localhost:9092
compression.type=none

$ I have no name!@kafka-controller-2:/$ cat /opt/bitnami/kafka/config/consumer.properties | grep -Ev "^$|^#"
bootstrap.servers=localhost:9092
group.id=test-consumer-group

robertdumitrescu commented 7 months ago

Any update on this?

joancafom commented 7 months ago

Hi @robertdumitrescu

I have tried to reproduce your problem locally, but everything seems to working fine on my side. I have noticed some little things while reviewing your case that may help you:

Some of the values.yaml are inaccurate or might not need to be provided:

image:
  debug: true
tolerations:
  - key: "node-role.kubernetes.io/control-plane"
    operator: "Equal"
    value: ""
    effect: "NoSchedule"
externalAccess:
  enabled: false
  autoDiscovery:
    enabled: true
kraft:
  enabled: true
-  replicas: 3
+# There is no such property as kraft.replicas
+#  replicas: 3
-  zookeeper:
-    enabled: false
+# Be wary of indentations, there is no such property as kraft.zookeeper
+#  zookeeper:
+#    enabled: false
extraEnvVars:
-  - name: KAFKA_ENABLE_KRAFT
-    value: "true"
+# The chart already controls the Kafka Kraft feature through the kraft.enabled param. Additionally, I think this var has been deprecated.
+#  - name: KAFKA_ENABLE_KRAFT
+#    value: "true"
  - name: KAFKA_CFG_DELETE_TOPIC_ENABLE
    value: "true"  # Add this line to enable topic deletion
  - name: KAFKA_CFG_AUTO_CREATE_TOPICS_ENABLE
    value: "true"  # Add this line to enable topic auto-creation
-  - name: KAFKA_CFG_SASL_MECHANISM_INTER_BROKER_PROTOCOL
-    value: "PLAIN"
+# Again, the chart already controls the through the listeners.interbroker.protocol param.
+#  - name: KAFKA_CFG_SASL_MECHANISM_INTER_BROKER_PROTOCOL
+#    value: "PLAIN"

controller:
  persistence:
    enabled: true
    size: 8Gi
broker:
  persistence:
    enabled: true
    size: 8Gi
volumePermissions:
  enabled: true
listeners:
  interbroker:
    protocol: PLAINTEXT
serviceAccount:
  create: true
rbac:
  create: true

Make sure to delete old PVCs from older deployments (warning: make sure to back up any important data you may have on them first). Although this should not be a problem in more recent versions, I'd say that deleting data from previous deployments is still a good practice.

$ helm uninstall kafka
# PVC objects still remain
$  kubectl get pvc
NAME                      STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
data-kafka-controller-0   Bound    pvc-2c342d7f-c43c-4443-82a5-0480b9f473ce   8Gi        RWO            standard       140m
data-kafka-controller-1   Bound    pvc-fb1d792c-560d-4eba-bc82-ad19d524fb5a   8Gi        RWO            standard       140m
data-kafka-controller-2   Bound    pvc-76f33cda-171b-49d6-8fc8-cf422e2c6c00   8Gi        RWO            standard       140m

$ kubectl delete pvc/data-kafka-controller-0 pvc/data-kafka-controller-1 pvc/data-kafka-controller-2

Having said that, I have used your values on my MiniKube cluster and everything seems to run smoothly:

$ k get pods -n processing
NAME                 READY   STATUS    RESTARTS   AGE
kafka-controller-0   1/1     Running   0          122m
kafka-controller-1   1/1     Running   0          122m
kafka-controller-2   1/1     Running   0          122m

robertdumitrescu commented 7 months ago

They will show up as running but they are not actually establishing the connection between them. I can try this tomorrow again and offer more logs. Let me know if there is any file or log file that you would want to see or if you need to know any other system configuration.

joancafom commented 7 months ago

Well, it seems to me cluster is working:

$ kubectl logs kafka-controller-2 -n processing 
[2024-03-08 14:56:26,607] INFO [RaftManager id=2] Completed transition to Leader(localId=2, epoch=13, epochStartOffset=2620, highWatermark=Optional.empty, voterStates={0=ReplicaState(nodeId=0, endOffset=Optional.empty, lastFetchTimestamp=-1, lastCaughtUpTimestamp=-1, hasAcknowledgedLeader=false), 1=ReplicaState(nodeId=1, endOffset=Optional.empty, lastFetchTimestamp=-1, lastCaughtUpTimestamp=-1, hasAcknowledgedLeader=false), 2=ReplicaState(nodeId=2, endOffset=Optional.empty, lastFetchTimestamp=-1, lastCaughtUpTimestamp=-1, hasAcknowledgedLeader=true)}) from CandidateState(localId=2, epoch=13, retries=1, voteStates={0=GRANTED, 1=REJECTED, 2=GRANTED}, highWatermark=Optional[LogOffsetMetadata(offset=2619, metadata=Optional.empty)], electionTimeoutMs=1326) (org.apache.kafka.raft.QuorumState)
[2024-03-08 14:56:26,616] INFO [RaftManager id=2] Completed transition to Unattached(epoch=14, voters=[0, 1, 2], electionTimeoutMs=1674) from Leader(localId=2, epoch=13, epochStartOffset=2620, highWatermark=Optional.empty, voterStates={0=ReplicaState(nodeId=0, endOffset=Optional.empty, lastFetchTimestamp=-1, lastCaughtUpTimestamp=-1, hasAcknowledgedLeader=false), 1=ReplicaState(nodeId=1, endOffset=Optional.empty, lastFetchTimestamp=-1, lastCaughtUpTimestamp=-1, hasAcknowledgedLeader=false), 2=ReplicaState(nodeId=2, endOffset=Optional[LogOffsetMetadata(offset=2621, metadata=Optional[(segmentBaseOffset=0,relativePositionInSegment=190780)])], lastFetchTimestamp=-1, lastCaughtUpTimestamp=-1, hasAcknowledgedLeader=true)}) (org.apache.kafka.raft.QuorumState)
[2024-03-08 14:56:26,617] INFO [RaftManager id=2] Vote request VoteRequestData(clusterId='WZW1wBtsAsBuOqw8aLMMWW', topics=[TopicData(topicName='__cluster_metadata', partitions=[PartitionData(partitionIndex=0, candidateEpoch=14, candidateId=1, lastOffsetEpoch=12, lastOffset=2620)])]) with epoch 14 is rejected (org.apache.kafka.raft.KafkaRaftClient)
[2024-03-08 14:56:26,617] INFO [QuorumController id=2] In the new epoch 14, the leader is (none). (org.apache.kafka.controller.QuorumController)
[2024-03-08 14:56:26,648] INFO [RaftManager id=2] Completed transition to FollowerState(fetchTimeoutMs=2000, epoch=14, leaderId=1, voters=[0, 1, 2], highWatermark=Optional.empty, fetchingSnapshot=Optional.empty) from Unattached(epoch=14, voters=[0, 1, 2], electionTimeoutMs=1674) (org.apache.kafka.raft.QuorumState)
[2024-03-08 14:56:26,648] INFO [QuorumController id=2] In the new epoch 14, the leader is 1. (org.apache.kafka.controller.QuorumController)

I can also produce and consume topics with no issues:

$ kafka-console-consumer.sh --consumer.config /tmp/client.properties --bootstrap-server kafka.processing.svc.cluster.local:9092 --topic test --from-beginning
DOMINO
TESTSTRING
A
THIS OR THAT
^CProcessed a total of 4 messages

robertdumitrescu commented 7 months ago

I literally uninstalled minikube, docker, reinstalled minikube, docker, spinned up a new minikube cluster, installed only kafka in isolation with your suggested changes and I am getting exactly the same error. Identical. I have absolutely no clue what I am doing wrong. Do you have any steps I could follow to debug this?

robertdumitrescu commented 7 months ago

Also, it seems like a lot of other people are reporting the same problem. As highlighted in the linked issues.

chanhosongterry commented 6 months ago

When I first provisioned the cluster, there were no issues. However, when I created a new cluster after that, a problem occurred. After spending a few hours, I deleted all PVCs and created the cluster again, and it worked normally.

joancafom commented 6 months ago

I'm really sorry the issue persists, but cannot reproduce it on my side. I would recommend you write down the precise steps you follow and share them in this thread, just in case we can find something odd.

Additionally, please make sure PVCs are deleted as suggested.

robertdumitrescu commented 6 months ago

Okay, I will try tomorrow to reproduce everything, I might do a screen capture. I already tried the suggested solution of deleting everything (PVs/PVCs)... I even tried on a fully new cluster and I still have the same problem.

github-actions[bot] commented 6 months ago

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

github-actions[bot] commented 5 months ago

Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.

MuhammadJamee commented 4 months ago

autoDiscovery

Did find the solution?

robertdumitrescu commented 4 months ago

No, I tried 2 weeks ago and I still got the same behaviour. I moved to Pulsar.

bitnami / charts