fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.84k stars 1.58k forks source link

Promethues Metric Buckets Exported Incorrect #8919

Closed sdmichelini closed 3 weeks ago

sdmichelini commented 5 months ago

Bug Report

Describe the bug When using Prometheus as a source and exporting it - the buckets on the histogram get messed up. In example below, new buckets got added and counts were missed for some of the le's in the bucket. Since counts are missed the values rendered in the histogram_quantile function in Grafana produce incorrect values.

To Reproduce

Example Input

my_tracing_duration_seconds_bucket{operation="my_operation",result="success",le="0"} 0
my_tracing_duration_seconds_bucket{operation="my_operation",result="success",le="0.001"} 4
my_tracing_duration_seconds_bucket{operation="my_operation",result="success",le="0.01"} 4
my_tracing_duration_seconds_bucket{operation="my_operation",result="success",le="0.05"} 4
my_tracing_duration_seconds_bucket{operation="my_operation",result="success",le="0.1"} 4
my_tracing_duration_seconds_bucket{operation="my_operation",result="success",le="0.5"} 5
my_tracing_duration_seconds_bucket{operation="my_operation",result="success",le="1"} 5
my_tracing_duration_seconds_bucket{operation="my_operation",result="success",le="2"} 5
my_tracing_duration_seconds_bucket{operation="my_operation",result="success",le="3"} 5
my_tracing_duration_seconds_bucket{operation="my_operation",result="success",le="5"} 5
my_tracing_duration_seconds_bucket{operation="my_operation",result="success",le="7"} 5
my_tracing_duration_seconds_bucket{operation="my_operation",result="success",le="10"} 5
my_tracing_duration_seconds_bucket{operation="my_operation",result="success",le="15"} 5
my_tracing_duration_seconds_bucket{operation="my_operation",result="success",le="20"} 5
my_tracing_duration_seconds_bucket{operation="my_operation",result="success",le="25"} 5
my_tracing_duration_seconds_bucket{operation="my_operation",result="success",le="30"} 5
my_tracing_duration_seconds_bucket{operation="my_operation",result="success",le="+Inf"} 5
my_tracing_duration_seconds_sum{operation="my_operation",result="success"} 0.296702106
my_tracing_duration_seconds_count{operation="my_operation",result="success"} 5

Example Output

my_tracing_duration_seconds_bucket{le="0.0",operation="my_operation",result="success"} 0
my_tracing_duration_seconds_bucket{le="0.1",operation="my_operation",result="success"} 4
my_tracing_duration_seconds_bucket{le="1.0",operation="my_operation",result="success"} 4
my_tracing_duration_seconds_bucket{le="2.0",operation="my_operation",result="success"} 4
my_tracing_duration_seconds_bucket{le="3.0",operation="my_operation",result="success"} 4
my_tracing_duration_seconds_bucket{le="4.0",operation="my_operation",result="success"} 5
my_tracing_duration_seconds_bucket{le="5.0",operation="my_operation",result="success"} 5
my_tracing_duration_seconds_bucket{le="6.0",operation="my_operation",result="success"} 5
my_tracing_duration_seconds_bucket{le="7.0",operation="my_operation",result="success"} 5
my_tracing_duration_seconds_bucket{le="8.0",operation="my_operation",result="success"} 5
my_tracing_duration_seconds_bucket{le="9.0",operation="my_operation",result="success"} 5
my_tracing_duration_seconds_bucket{le="10.0",operation="my_operation",result="success"} 5
my_tracing_duration_seconds_bucket{le="12.5",operation="my_operation",result="success"} 5
my_tracing_duration_seconds_bucket{le="15.0",operation="my_operation",result="success"} 5
my_tracing_duration_seconds_bucket{le="17.5",operation="my_operation",result="success"} 5
my_tracing_duration_seconds_bucket{le="20.0",operation="my_operation",result="success"} 5
my_tracing_duration_seconds_bucket{le="22.5",operation="my_operation",result="success"} 5
my_tracing_duration_seconds_bucket{le="25.0",operation="my_operation",result="success"} 0
my_tracing_duration_seconds_bucket{le="27.5",operation="my_operation",result="success"} 0
my_tracing_duration_seconds_bucket{le="30.0",operation="my_operation",result="success"} 0
my_tracing_duration_seconds_bucket{le="+Inf",operation="my_operation",result="success"} 1
my_tracing_duration_seconds_sum{operation="my_operation",result="success"} 0.29670210600000002
my_tracing_duration_seconds_count{operation="my_operation",result="success"} 5

Expected behavior

Your Environment

cosmo0920 commented 4 months ago

I'm trying to reproduce with the minimal case. I created a minimal creating histogram taking tool:

package main

import (
    "net/http"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    my_duration = prometheus.NewHistogram(prometheus.HistogramOpts{
        Name:    "my_tracing_duration_seconds",
        Help:    "tracing duration in seconds",
        Buckets: []float64{.0, .001, .01, .05, .1, .5, 1, 2, 3, 5, 7, 10, 15, 20, 25, 30},
    })
)

func init() {
    prometheus.MustRegister(my_duration)
}

func main() {
    http.Handle("/metrics", promhttp.Handler())
    go http.ListenAndServe(":8080", nil)
    simulate()
}

func simulate() {
    my_duration.Observe(0.001)
    my_duration.Observe(0.001)
    my_duration.Observe(0.001)
    my_duration.Observe(0.001)
    my_duration.Observe(0.292702106)
    for {
        time.Sleep(1 * time.Second)
    }
}

And compile it.

Then, using prometheus_scrape to scrape the served non-standard shape of bucket and specify prometheus_exporter as a output.

However, I obtained the non-broken result for this case:

# HELP my_tracing_duration_seconds tracing duration in seconds
# TYPE my_tracing_duration_seconds histogram
my_tracing_duration_seconds_bucket{le="0.0"} 0
my_tracing_duration_seconds_bucket{le="0.001"} 4
my_tracing_duration_seconds_bucket{le="0.01"} 4
my_tracing_duration_seconds_bucket{le="0.05"} 4
my_tracing_duration_seconds_bucket{le="0.1"} 4
my_tracing_duration_seconds_bucket{le="0.5"} 5
my_tracing_duration_seconds_bucket{le="1.0"} 5
my_tracing_duration_seconds_bucket{le="2.0"} 5
my_tracing_duration_seconds_bucket{le="3.0"} 5
my_tracing_duration_seconds_bucket{le="5.0"} 5
my_tracing_duration_seconds_bucket{le="7.0"} 5
my_tracing_duration_seconds_bucket{le="10.0"} 5
my_tracing_duration_seconds_bucket{le="15.0"} 5
my_tracing_duration_seconds_bucket{le="20.0"} 5
my_tracing_duration_seconds_bucket{le="25.0"} 5
my_tracing_duration_seconds_bucket{le="30.0"} 5
my_tracing_duration_seconds_bucket{le="+Inf"} 5
my_tracing_duration_seconds_sum 0.29670210600000002
my_tracing_duration_seconds_count 5
cosmo0920 commented 4 months ago

@sdmichelini Is there any further requirements to break consistency of histogram? passing-through an instance of Prometheus is needed? How to create the broken histogram which uses a custom bucket? Is that just collected from node_exporter or custom client to sent Prometheus endpoint?

sdmichelini commented 4 months ago

All I did was expose a prometheus histogram with the buckets above as an input and I got the following output

cosmo0920 commented 4 months ago

I tried to use this Prometheus text format file:

% cat problematic_prom/histgram.prom                                                                             [Fail]
# HELP my_tracing_duration_seconds tracing duration in seconds
# TYPE my_tracing_duration_seconds histogram
my_tracing_duration_seconds_bucket{le="0"} 0
my_tracing_duration_seconds_bucket{le="0.001"} 4
my_tracing_duration_seconds_bucket{le="0.01"} 4
my_tracing_duration_seconds_bucket{le="0.05"} 4
my_tracing_duration_seconds_bucket{le="0.1"} 4
my_tracing_duration_seconds_bucket{le="0.5"} 5
my_tracing_duration_seconds_bucket{le="1"} 5
my_tracing_duration_seconds_bucket{le="2"} 5
my_tracing_duration_seconds_bucket{le="3"} 5
my_tracing_duration_seconds_bucket{le="5"} 5
my_tracing_duration_seconds_bucket{le="7"} 5
my_tracing_duration_seconds_bucket{le="10"} 5
my_tracing_duration_seconds_bucket{le="15"} 5
my_tracing_duration_seconds_bucket{le="20"} 5
my_tracing_duration_seconds_bucket{le="25"} 5
my_tracing_duration_seconds_bucket{le="30"} 5
my_tracing_duration_seconds_bucket{le="+Inf"} 5
my_tracing_duration_seconds_sum 0.296702106
my_tracing_duration_seconds_count 5

And ingesting with node_exporter:

$ ./node_exporter --collector.textfile.directory=problematic_prom  

Also, the internal Prometheus' metrics does not break its bucket and values:

% curl 'http://localhost:9090/api/v1/query_range?query=my_tracing_duration_seconds_bucket&start=2024-06-11T20:10:30.781Z&end=2024-06-30T20:11:00.781Z&step=3h&format=prometheus' | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2668    0  2668    0     0  3549k      0 --:--:-- --:--:-- --:--:-- 2605k
{
  "status": "success",
  "data": {
    "resultType": "matrix",
    "result": [
      {
        "metric": {
          "__name__": "my_tracing_duration_seconds_bucket",
          "instance": "localhost:9100",
          "job": "node_exporter",
          "le": "+Inf"
        },
        "values": [
          [
            1719216630.781,
            "5"
          ]
        ]
      },
      {
        "metric": {
          "__name__": "my_tracing_duration_seconds_bucket",
          "instance": "localhost:9100",
          "job": "node_exporter",
          "le": "0"
        },
        "values": [
          [
            1719216630.781,
            "0"
          ]
        ]
      },
      {
        "metric": {
          "__name__": "my_tracing_duration_seconds_bucket",
          "instance": "localhost:9100",
          "job": "node_exporter",
          "le": "0.001"
        },
        "values": [
          [
            1719216630.781,
            "4"
          ]
        ]
      },
      {
        "metric": {
          "__name__": "my_tracing_duration_seconds_bucket",
          "instance": "localhost:9100",
          "job": "node_exporter",
          "le": "0.01"
        },
        "values": [
          [
            1719216630.781,
            "4"
          ]
        ]
      },
      {
        "metric": {
          "__name__": "my_tracing_duration_seconds_bucket",
          "instance": "localhost:9100",
          "job": "node_exporter",
          "le": "0.05"
        },
        "values": [
          [
            1719216630.781,
            "4"
          ]
        ]
      },
      {
        "metric": {
          "__name__": "my_tracing_duration_seconds_bucket",
          "instance": "localhost:9100",
          "job": "node_exporter",
          "le": "0.1"
        },
        "values": [
          [
            1719216630.781,
            "4"
          ]
        ]
      },
      {
        "metric": {
          "__name__": "my_tracing_duration_seconds_bucket",
          "instance": "localhost:9100",
          "job": "node_exporter",
          "le": "0.5"
        },
        "values": [
          [
            1719216630.781,
            "5"
          ]
        ]
      },
      {
        "metric": {
          "__name__": "my_tracing_duration_seconds_bucket",
          "instance": "localhost:9100",
          "job": "node_exporter",
          "le": "1"
        },
        "values": [
          [
            1719216630.781,
            "5"
          ]
        ]
      },
      {
        "metric": {
          "__name__": "my_tracing_duration_seconds_bucket",
          "instance": "localhost:9100",
          "job": "node_exporter",
          "le": "10"
        },
        "values": [
          [
            1719216630.781,
            "5"
          ]
        ]
      },
      {
        "metric": {
          "__name__": "my_tracing_duration_seconds_bucket",
          "instance": "localhost:9100",
          "job": "node_exporter",
          "le": "15"
        },
        "values": [
          [
            1719216630.781,
            "5"
          ]
        ]
      },
      {
        "metric": {
          "__name__": "my_tracing_duration_seconds_bucket",
          "instance": "localhost:9100",
          "job": "node_exporter",
          "le": "2"
        },
        "values": [
          [
            1719216630.781,
            "5"
          ]
        ]
      },
      {
        "metric": {
          "__name__": "my_tracing_duration_seconds_bucket",
          "instance": "localhost:9100",
          "job": "node_exporter",
          "le": "20"
        },
        "values": [
          [
            1719216630.781,
            "5"
          ]
        ]
      },
      {
        "metric": {
          "__name__": "my_tracing_duration_seconds_bucket",
          "instance": "localhost:9100",
          "job": "node_exporter",
          "le": "25"
        },
        "values": [
          [
            1719216630.781,
            "5"
          ]
        ]
      },
      {
        "metric": {
          "__name__": "my_tracing_duration_seconds_bucket",
          "instance": "localhost:9100",
          "job": "node_exporter",
          "le": "3"
        },
        "values": [
          [
            1719216630.781,
            "5"
          ]
        ]
      },
      {
        "metric": {
          "__name__": "my_tracing_duration_seconds_bucket",
          "instance": "localhost:9100",
          "job": "node_exporter",
          "le": "30"
        },
        "values": [
          [
            1719216630.781,
            "5"
          ]
        ]
      },
      {
        "metric": {
          "__name__": "my_tracing_duration_seconds_bucket",
          "instance": "localhost:9100",
          "job": "node_exporter",
          "le": "5"
        },
        "values": [
          [
            1719216630.781,
            "5"
          ]
        ]
      },
      {
        "metric": {
          "__name__": "my_tracing_duration_seconds_bucket",
          "instance": "localhost:9100",
          "job": "node_exporter",
          "le": "7"
        },
        "values": [
          [
            1719216630.781,
            "5"
          ]
        ]
      }
    ]
  }
}
edsiper commented 3 months ago

we cannot reproduce, changing milestone

github-actions[bot] commented 4 weeks ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] commented 3 weeks ago

This issue was closed because it has been stalled for 5 days with no activity.