Closed eroji closed 3 days ago
Hi @eroji,
I just tried to reproduce the issue using the latest Bitnami Kafka Chart and everything worked as expected. I applied the same changes you mentioned in the values.yaml file
diff --git a/bitnami/kafka/values.yaml b/bitnami/kafka/values.yaml
index f2c92ee02f..64872c7d64 100644
--- a/bitnami/kafka/values.yaml
+++ b/bitnami/kafka/values.yaml
@@ -164,7 +164,7 @@ listeners:
## @param listeners.client.sslClientAuth Optional. If SASL_SSL is enabled, configure mTLS TLS authentication type. If SSL protocol is enabled, overrides tls.authType for this listener. Allowed values are 'none', 'requested' and 'required'
client:
containerPort: 9092
- protocol: SASL_PLAINTEXT
+ protocol: SASL_SSL
name: CLIENT
sslClientAuth: ""
## @param listeners.controller.name Name for the Kafka controller listener
@@ -264,8 +264,8 @@ sasl:
##
client:
users:
- - user1
- passwords: ""
+ - admin
+ passwords: "somepassword"
## Credentials for Zookeeper communications.
## @param sasl.zookeeper.user Username for zookeeper communications when SASL is enabled.
## @param sasl.zookeeper.password Password for zookeeper communications when SASL is enabled.
@@ -320,7 +320,7 @@ tls:
## @param tls.autoGenerated Generate automatically self-signed TLS certificates for Kafka brokers. Currently only supported if `tls.type` is `PEM`
## Note: ignored when using 'jks' format or `tls.existingSecret` is not empty
##
- autoGenerated: false
+ autoGenerated: true
## @param tls.customAltNames Optionally specify extra list of additional subject alternative names (SANs) for the automatically generated TLS certificates.
##
customAltNames: []
@@ -1405,7 +1405,7 @@ broker:
service:
## @param service.type Kubernetes Service type
##
- type: ClusterIP
+ type: LoadBalancer
## @param service.ports.client Kafka svc port for client connections
## @param service.ports.controller Kafka svc port for controller connections. It is used if "kraft.enabled: true"
## @param service.ports.interbroker Kafka svc port for inter-broker connections
and deployed the solution. Pods were running for 10+ mins without problems. Please debug the issue in your cluster and ensure you are not running into a performance issue
NAME READY STATUS RESTARTS AGE
kafka-controller-0 1/1 Running 0 11m
kafka-controller-1 1/1 Running 0 11m
kafka-controller-2 1/1 Running 0 11m
@jotamartos I can confirm this is also happening on ARM-based clusters. Our cluster is in AWS using their ARM-based machines.
We use the helm chart oci://registry-1.docker.io/bitnamicharts/kafka
using the following values:
extraConfigYaml:
"authorizer.class.name": "org.apache.kafka.metadata.authorizer.StandardAuthorizer"
"super.users": "User:controller_user"
listeners:
client:
containerPort: 9092
protocol: SASL_SSL
name: CLIENT
controller:
name: CONTROLLER
containerPort: 9093
protocol: SASL_SSL
sslClientAuth: "required"
sasl:
enabledMechanisms: PLAIN #,SCRAM-SHA-256,SCRAM-SHA-512
tls:
type: PEM
existingSecret: <censored>
keystorePassword: <censored>
truststorePassword: <censored>
The cluster comes up and works well, but each controller over time gets OOMKilled after several minutes.
Note that during this time the cluster isn't under load at all, we don't yet have any apps sending messages to the cluster so far.
The OOM happens only after several minutes, as the cluster seemingly idles something soaks up memory.
Hi,
I tried to reproduce the issue again in a ARM-based cluster (the one Docker Desktop provides in a M1-based Mac OS X) and didn't get any error
$ k get pods
NAME READY STATUS RESTARTS AGE
kafka-controller-0 1/1 Running 0 49m
kafka-controller-1 1/1 Running 0 49m
kafka-controller-2 1/1 Running 0 49m
I applied the same values file I used in my previous message. In that case, the tests were executed in a x64-based cluster in GKE.
Please note that you can set a different resourcesPreset configuration for you deployment and see if that solves the issue in your cluster.
## @param controller.resourcesPreset Set container resources according to one common preset (allowed values: none, nano, micro, small, medium, large, xlarge, 2xlarge). This is ignored if controller.resources is set (controller.resources is recommended for production).
## More information: https://github.com/bitnami/charts/blob/main/bitnami/common/templates/_resources.tpl#L15
##
resourcesPreset: "small"
We have also made some progress; we set heapOpts
in the helm chart to:
"-Xmx512m -Xms512m -XX:MetaspaceSize=96m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80 -XX:+ExplicitGCInvokesConcurrent"
This also stopped the issue for us mostly, but I'm unsure how good these options are.
We are seeing some pods now being killed but no longer due to K8s OOMKilled
as far as I can see. Could it be possible that Java with this reduced JVM space as per above decides to "give up" at some point, which would yield issues?
I will try and increase the opts and see if that fixes the issue in the coming days.
This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.
Turns out the above heapOpts
worked for us; we faced some pod terminations due to our k8s cluster scaling which isn't related to Kafka at all (or this helm chart).
As such I would suggest tweaking the defaults in the helm chart to help with OOM but other than that we seem to be in OK.
As such I would suggest tweaking the defaults in the helm chart to help with OOM but other than that we seem to be in OK.
Would you like to contribute and improve the Chart? You can follow our contributing guidelines and the team will be more than happy to review the changes.
I have recently contributed in the form of https://github.com/bitnami/charts/pull/27877 but not yet received any feedback; I could consider helping here once that first PR is through.
Thanks! I can see the PR you mentioned is now merged 😄
Please do not hesitate to suggest any change in the solution
This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.
Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.
Name and Version
bitnami/kafka:9.6.1
What architecture are you using?
amd64
What steps will reproduce the bug?
Deploy a 3 replica statefulset cluster via Helm chart
Are you using any custom parameters or values?
What is the expected behavior?
The default heap size is appears to be 1GB but the memory limit is 768Mi. The cluster is not actually being used yet so it should not run out of memory on its own. If however it requires more memory by default in Kraft mode, then the default should be that in the Helm chart.
What do you see instead?
The pods are being killed due to failed liveness/readiness check.
The final entries in the log which seems to be consistent each time it is killed
The cause for Kubernetes killing the pod
Additional information
No response