danielqsj / kafka_exporter

Kafka exporter for Prometheus
Apache License 2.0
2.1k stars 602 forks source link

Metric error: collected metric was collected before with the same name and label values #306

Open faabsen opened 2 years ago

faabsen commented 2 years ago

When using the exporter (version: danielqsj/kafka-exporter:v1.4.2), we are sometimes experiencing the following error that no metrics are displayed:

An error has occurred while serving metrics:
collected metric "kafka_consumergroup_members" { label:<name:"consumergroup" value:"<NAME>" > gauge:<value:0 > } was collected before with the same name and label values

This does not happens with the other "exporter" from yahoo (https://github.com/yahoo/CMAK).

After manually reassigning the __consumer_offsets (topic) in Kafka, the exporter starts on collecting the metrics correctly. Anyone experiencing a similar behaviour before?

yinyu985 commented 2 years ago
An error has occurred while serving metrics:

43 error(s) occurred:
* collected metric "kafka_consumergroup_members" { label:<name:"consumergroup" value:"winlogbeat_printer" > gauge:<value:2 > } was collected before with the same name and label values
* collected metric "kafka_consumergroup_current_offset" { label:<name:"consumergroup" value:"winlogbeat_printer" > label:<name:"partition" value:"0" > label:<name:"topic" value:"winlogbeat_printer" > gauge:<value:1617 > } was collected before with the same name and label values
* collected metric "kafka_consumergroup_lag" { label:<name:"consumergroup" value:"winlogbeat_printer" > label:<name:"partition" value:"0" > label:<name:"topic" value:"winlogbeat_printer" > gauge:<value:0 > } was collected before with the same name and label values

I have the same problem as you

rmrf commented 2 years ago

I have same error, but reassign __consumer_offsets topic didn't help.

VolcanicSnow commented 2 years ago

I have same error ` An error has occurred while serving metrics:

1004 error(s) occurred:

alexinthesky commented 1 year ago

same issue here. We are plugging ourselves to azure eventhub and I'm noticing this weird behavior in the logs:

[sarama] 2022/10/03 13:30:10 client/brokers registered new broker #0 at ehn-central.servicebus.windows.net:9093 [sarama] 2022/10/03 13:30:10 client/brokers registered new broker #1 at Ehn-central.servicebus.windows.net:9093 [sarama] 2022/10/03 13:30:10 client/brokers registered new broker #2 at EHn-central.servicebus.windows.net:9093 [sarama] 2022/10/03 13:30:10 client/brokers registered new broker #3 at EHN-central.servicebus.windows.net:9093

I'm specifying only one server and seeng 4 lines of this with different case sensitivity for the (same )broker name.

lhaussknecht commented 1 year ago

We have the same issue here , after a lot of rebalancing going on in the night. Also an EventHub user. Also the same casing symptom.

[sarama] 2022/10/06 06:04:53 Connected to broker at digizxxxxxxxxxxx.servicebus.windows.net:9093 (registered as #0) [sarama] 2022/10/06 06:04:55 Connected to broker at Digizxxxxxxxxxxx.servicebus.windows.net:9093 (registered as #1) [sarama] 2022/10/06 06:04:55 Connected to broker at DIgizxxxxxxxxxxx.servicebus.windows.net:9093 (registered as #2) [sarama] 2022/10/06 06:04:56 Connected to broker at DIGizxxxxxxxxxxx.servicebus.windows.net:9093 (registered as #3) [sarama] 2022/10/06 06:04:57 Connected to broker at DIGIzxxxxxxxxxxx.servicebus.windows.net:9093 (registered as #4)

alexinthesky commented 1 year ago

hey man happy to hear we're not alone in this. Did you get things sorted out? we have an open ticket in Azure. It's been very long time. they are not sharing their changelog or deployments but I defo believe it's related to changes in their broker LB system.

mshekharee commented 1 year ago

We are also seeing the same issue especially after upgrading AKS from 1.21 to 1.23. Is there any update on the solution?

misitechen commented 1 year ago

Did you get things sorted out?

mshekharee commented 1 year ago

@misitechen

This was sorted out by raising a support case with Microsoft. Below was the Cause for the same.

Root Cause: As part of recent upgrade there is a change made in the service (Kafka request handler) that the service returns the list of virtual brokers (16 brokers) as part of metadata response so that client application process can create/manage multiple TCP connections to a topic and achieve better performance. However, the change has an impact on produce API in the case that connection(s) are not fully utilized and become idle due to inactivity. In such case, producer app can hit request timeout when sending a message if the message was sent over a connection which was already terminated due to idleness and result in retry.

Resolution: As part of mitigation we have reverted the change and have virtual broker host return one address again.

lhaussknecht commented 1 year ago

@mshekharee Very interesting! Would you mind sharing a date range, when Microsoft applied the update and reverted it?

mshekharee commented 1 year ago

@lhaussknecht Looks like Microsoft is handling this on account to account basis. The changes for our account was reverted on a month ago

KD0735 commented 1 year ago

Same question. Does anyone deal with it now?

lhaussknecht commented 1 year ago

We worked around this by adding --group.filter='.+' to the argument list.

davidpechcz commented 1 year ago

We had similar issue with KafkaExporter with managed Kafka on Oracle Cloud (OCI Streams) - solved with

--group.filter='.+'
xiangrm commented 4 months ago

However, the addition of the '--group.filter' parameter will cause that the consumer indicator cannot be collected