We have a few topics in our Kafka cluster, around 10 brokers, and only a few partitions. Consumers (groups) in the 1000s.
We have deployed Kafka using Strimzi Kafka Operator Helm chart (0.34.0).
This comes equipped with a kafka-exporter instance.
When trying to scrape the exposed endpoint the following events a logged in the pod description:
The pods will try to consume absolutely bogus amounts of memory causing them all to get evicted
The following is our Kafka CRD (again, using Strimzi operator):
We haven't really set any custom configuration, but i do not see how these amounts of energy consumption should even be possible.
Any hints as to where the potential memory leak could be / where we might have set it up wrong?
Fyi, we did add gracious amounts of resources to the Kafka CRD.
EDIT:
I have identified an extremely large number of consumer group offsets being stored in the cluster and therefore processed by the exporter, as we have spun consumers up and down with unique group ids many times creating millions over time.
I deleted all inactive consumer groups, as they congested the exporter, by executing the following script inside a broker:
We have a few topics in our Kafka cluster, around 10 brokers, and only a few partitions. Consumers (groups) in the 1000s.
We have deployed Kafka using Strimzi Kafka Operator Helm chart (0.34.0).
This comes equipped with a kafka-exporter instance.
When trying to scrape the exposed endpoint the following events a logged in the pod description:
The pods will try to consume absolutely bogus amounts of memory causing them all to get evicted
The following is our Kafka CRD (again, using Strimzi operator):
We haven't really set any custom configuration, but i do not see how these amounts of energy consumption should even be possible.
Any hints as to where the potential memory leak could be / where we might have set it up wrong?
Fyi, we did add gracious amounts of resources to the Kafka CRD.
EDIT:
I have identified an extremely large number of consumer group offsets being stored in the cluster and therefore processed by the exporter, as we have spun consumers up and down with unique group ids many times creating millions over time.
I deleted all inactive consumer groups, as they congested the exporter, by executing the following script inside a broker:
This issue will be closed, as it was fact just an absurd amount of data. We could of course also limit the exporter to not scrape the offsets at all.