elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
110 stars 4.93k forks source link

Handle duplicated TYPE line for prometheus metrics #18813

Closed crisdarocha closed 1 year ago

crisdarocha commented 4 years ago

Describe the enhancement: Opening the issue for enhancement on behalf of a user.

They are collecting MicroProfile Metrics from Payara in JSON format.

- module: openmetrics
metricsets: ['collector']
period: 10s
hosts: ['localhost:8080']

# This module uses the Prometheus collector metricset, all
# the options for this metricset are also available here.
metrics_path: /metrics/
metrics_filters:
include: []
exclude: []

Unfortunately the Payara versions 5.193.1, 5.194 and 5.201 have a bug in their MicroProfile Metrics implementation and the output contains repeated TYPE lines

# TYPE base_gc_total_total counter
# HELP base_gc_total_total Displays the total number of collections that have occurred. This attribute lists -1 if the collection count is undefined for this collector.
base_gc_total_total{name="PS MarkSweep"} 4
...
# TYPE base_gc_total_total counter
# HELP base_gc_total_total Displays the total number of collections that have occurred. This attribute lists -1 if the collection count is undefined for this collector.
base_gc_total_total{name="PS Scavenge"} 34

This violates the standard and Metricbeat yields an error:

  "error": {
    "message": "unable to decode response from prometheus endpoint: decoding of metric family failed: text format parsing error in line 43: second TYPE line for metric name \"base_gc_total_total\", or TYPE reported after samples"
  },

The request is to be able to have Metricbeat to ignore duplicate (identical) TYPE lines (convert Error to Warning) and process data nevertheless.

This bug is fixed in Payara 5.202RC1 but the upgrade is complex and lengthy due to scale of usage.

Describe a specific use case for the enhancement or feature: Allow users in "bugged" version of Payara to still use Metricbeat.

As per private discussion with @exekias and @sorantis . Opening the case to keep a register of demand.

elasticmachine commented 4 years ago

Pinging @elastic/integrations-platforms (Team:Platforms)

exekias commented 4 years ago

I think the underlying problem is that we use a different lib to parse metrics than Prometheus is using, this seems to cause some unexpected behaviors when the source data doesn't really follow with the format.

We may want to investigate a way to use the same code paths that Prometheus is using to collect metrcs

@ChrsMark I'm not sure this one is trivial, what's the approach you had in mind?

ChrsMark commented 4 years ago

Hmm, yeap it might not be so easy yes. The code cannot even "unpack" the response, right? I had in mind that the error occurs after the response is unpacked and can be processed to fix this kind of issues.

hgruck commented 3 years ago

Is there any plan to fix this?

ChrsMark commented 3 years ago

Hey we plan to move to an improved parsing library so this might fix this one too: https://github.com/elastic/beats/issues/24707

xuoguoto commented 3 years ago

I too am getting this error on metricbeat version 7.13.1 (amd64), libbeat 7.13.1 [2d80f6e99f41b65a270d61706fa98d13cfbda18d]

module/wrapper.go:259 Error fetching data for metricset prometheus.collector: unable to decode response from prometheus endpoint: decoding of metric family failed: text format parsing error in line 45: second TYPE line for metric name "_err_null_node_blackholed_packets", or TYPE reported after samples

ChrsMark commented 3 years ago

@xuoguoto do you have a similar case with what is described in this issue's description? If so I'm afraid that there is no quick fix for this at the moment since this violated the Prometheus standard. As mentioned in previous comment these kind of issues might be resolved when/if we finally move to a new parsing library (#24707).

xuoguoto commented 3 years ago

@ChrsMark From the exporter, here is what I see when greping for _err_null_node_blackholed_packets:

# TYPE _err_null_node_blackholed_packets counter
_err_null_node_blackholed_packets{thread="0"} 0
# TYPE _err_null_node_blackholed_packets counter
_err_null_node_blackholed_packets{thread="1"} 250319
# TYPE _err_null_node_blackholed_packets counter
_err_null_node_blackholed_packets{thread="2"} 1
# TYPE _err_null_node_blackholed_packets counter
_err_null_node_blackholed_packets{thread="3"} 140111
# TYPE _err_null_node_blackholed_packets counter
_err_null_node_blackholed_packets{thread="4"} 0
# TYPE _err_null_node_blackholed_packets counter
_err_null_node_blackholed_packets{thread="5"} 0
# TYPE _err_null_node_blackholed_packets counter
_err_null_node_blackholed_packets{thread="6"} 0
# TYPE _err_null_node_blackholed_packets counter
_err_null_node_blackholed_packets{thread="7"} 0
# TYPE _err_null_node_blackholed_packets counter
_err_null_node_blackholed_packets{thread="8"} 0

Is this a problem?

hamelg commented 3 years ago

Here we got this issue but with a slight variation.

unable to decode response from prometheus endpoint: decoding of metric family failed: text format parsing error in line 58: second TYPE line for metric name "jvm_classes_loaded", or TYPE reported after samples

sh-4.2# curl -s http://10.1.86.129:9779/metrics|cat -n |grep jvm_classes_loaded
   54  # HELP jvm_classes_loaded The number of classes that are currently loaded in the JVM
   55  # TYPE jvm_classes_loaded gauge
   56  jvm_classes_loaded 28959.0
   57  # HELP jvm_classes_loaded_total The total number of classes that have been loaded since the JVM has started ex
ecution
   58  # TYPE jvm_classes_loaded_total counter
   59  jvm_classes_loaded_total 29166.0
peterschrott commented 2 years ago

@hamelg, I encountered the same issue as you. Metrics are exposed via Prometheus JMX Exporter. The weird thing is, that metricbeat behaves differnt on different versions of the JMX Exporter

With MX Exporter v 0.14.0 everything works as expected – metrics are exported, with v 0.16.1 I get the following error

2022-04-05T17:33:07.769+0200    INFO    module/wrapper.go:259   Error fetching data for metricset prometheus.collector: unable to decode response from prometheus endpoint: decoding of metric family failed: text format parsing error in line 4: second TYPE line for metric name "jvm_classes_loaded", or TYPE reported after samples

Output with MX Exporter v 0.14.0

# HELP jvm_classes_loaded The number of classes that are currently loaded in the JVM
# TYPE jvm_classes_loaded gauge
jvm_classes_loaded 39039.0
# HELP jvm_classes_loaded_total The total number of classes that have been loaded since the JVM has started execution
# TYPE jvm_classes_loaded_total counter
jvm_classes_loaded_total 39481.0

Output with MX Exporter v 0.16.1:

# HELP jvm_classes_loaded The number of classes that are currently loaded in the JVM
# TYPE jvm_classes_loaded gauge
jvm_classes_loaded 18998.0
# HELP jvm_classes_loaded_total The total number of classes that have been loaded since the JVM has started execution
# TYPE jvm_classes_loaded_total counter
jvm_classes_loaded_total 18998.0
ChrsMark commented 1 year ago

Hey @peterschrott ! Could you also share the returned headers in both cases if you curl the endpoints?

ChrsMark commented 1 year ago

A quick heads-up on this.

A Prometheus server is able to scrape metrics from an endpoint that exposes duplicated metrics. In that case both metrics are collected without and issue. I verified that the case reported in the issue's description can be handled without an issue by a Prom Server.

So for an endpoint exposing the following:

# TYPE base_gc_total_total counter
# HELP base_gc_total_total Displays the total number of collections that have occurred. This attribute lists -1 if the collection count is undefined for this collector.
base_gc_total_total{name="PS MarkSweep"} 4
# TYPE base_gc_total_total counter
# HELP base_gc_total_total Displays the total number of collections that have occurred. This attribute lists -1 if the collection count is undefined for this collector.
base_gc_total_total{name="PS Scavenge"} 34

The Prom Server will collect both metrics for example:

base_gc_total_total{instance="containerd:1338", job="duplicate-types", name="PS MarkSweep"}  4
base_gc_total_total{instance="containerd:1338", job="duplicate-types", name="PS Scavenge"} 34

So in that case with the current Metricbeat module we are not able to provide the same experience. The upgrade of the library at https://github.com/elastic/beats/pull/33865 will solve this issue.

As far as the java client exporters is concerned, I cannot say for sure what was the issue but I suspect that it has to do with https://github.com/prometheus/client_java/releases/tag/parent-0.10.0 or something similar as reported at https://github.com/elastic/beats/issues/24554. In such cases the headers need to be verified and if the endpoint is openmetrics users are advised to use the openmetrics module introduced with https://github.com/elastic/beats/pull/27269.