elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
100 stars 4.92k forks source link

Missing count and sum in prometheus histogram metrics #41573

Open henrikno opened 4 days ago

henrikno commented 4 days ago

This prometheus histogram metric:

cilium_k8s_client_rate_limiter_duration_seconds_bucket{method="GET",path="/version",le="0.005"} 23
cilium_k8s_client_rate_limiter_duration_seconds_bucket{method="GET",path="/version",le="0.025"} 23
cilium_k8s_client_rate_limiter_duration_seconds_bucket{method="GET",path="/version",le="0.1"} 23
cilium_k8s_client_rate_limiter_duration_seconds_bucket{method="GET",path="/version",le="0.25"} 23
cilium_k8s_client_rate_limiter_duration_seconds_bucket{method="GET",path="/version",le="0.5"} 23
cilium_k8s_client_rate_limiter_duration_seconds_bucket{method="GET",path="/version",le="1"} 23
cilium_k8s_client_rate_limiter_duration_seconds_bucket{method="GET",path="/version",le="2"} 23
cilium_k8s_client_rate_limiter_duration_seconds_bucket{method="GET",path="/version",le="4"} 23
cilium_k8s_client_rate_limiter_duration_seconds_bucket{method="GET",path="/version",le="8"} 23
cilium_k8s_client_rate_limiter_duration_seconds_bucket{method="GET",path="/version",le="15"} 23
cilium_k8s_client_rate_limiter_duration_seconds_bucket{method="GET",path="/version",le="30"} 23
cilium_k8s_client_rate_limiter_duration_seconds_bucket{method="GET",path="/version",le="60"} 23
cilium_k8s_client_rate_limiter_duration_seconds_bucket{method="GET",path="/version",le="+Inf"} 23
cilium_k8s_client_rate_limiter_duration_seconds_sum{method="GET",path="/version"} 3.2705999999999995e-05
cilium_k8s_client_rate_limiter_duration_seconds_count{method="GET",path="/version"} 23

Gets turned into this in ES:

    "prometheus": {
      "cilium_k8s_client_api_latency_time_seconds": {
        "histogram": {
          "values": [],
          "counts": []
        }
      },
      "cilium_k8s_client_rate_limiter_duration_seconds": {
        "histogram": {
          "values": [],
          "counts": []
        }
      },
      "labels": {
        "instance": "10.21.98.177:9962",
        "job": "prometheus",
        "method": "GET",
        "path": "/version"
      },
      "labels_fingerprint": "lrtsgTghb5LOY7ViR50IWXf7y6M="
    },

We are using use_types and rate, 1. because it's the default in the elastic-agent integration, and 2. to be able to query them in Kibana. https://www.elastic.co/docs/current/integrations/prometheus#histograms-and-types-1 However, the values don't look like the example, and what we expect. The _count and _sum is missing. I was hoping to query the rate of the count/sum.

Sidenote, if the buckets are diffed, I expected it to be named .rate. I was confused why the values did not match what the prometheus endpoint waas reporting at all. This is what it does for counters. Also for counters it keeps the original, but I can see that that would increase the storage.

Another thing that looks funny is the empty values {"values":[],"counts":[]}. Looks like this is happening because we use TSDS, which is also the default in the prometheus integration.

I tried to reproduce it with just metricbeat 8.15.3.

Put the example at the top in a file called metrics, then run python3 -m http.server 9000

cat modules.d/prometheus.yml
- module: prometheus
  period: 20s
  hosts: ["localhost:9000"]
  metrics_path: /metrics
  use_types: true
  rate_counters: false

Set this in metricbeat.yml

output.console:
  pretty: true

With use_types: true, rate_counters: false, I get a bunch of zeroes:

   "cilium_k8s_client_rate_limiter_duration_seconds": {
      "histogram": {
        "values": [
          0.0025,
          0.015,
          0.0625,
          0.175,
          0.375,
          0.75,
          1.5,
          3,
          6,
          11.5,
          22.5,
          45,
          60
        ],
        "counts": [
          0,
          0,
          0,
          0,
          0,
          0,
          0,
          0,
          0,
          0,
          0,
          0,
          0
        ]
      }

With rate_counters: false I expected to get the exact same values that prometheus reports. Also missing _sum and _count. If I disable use_types, I do see the values, but there's one document per bucket which is extremely difficult to query.

jlind23 commented 1 day ago

Thanks @henrikno for creating this. Let me pull @lalit-satapathy in as his team is the one owning the prometheus integration.

lalit-satapathy commented 1 day ago

Thanks @henrikno for creating this. Let me pull @lalit-satapathy in as his team is the one owning the prometheus integration.

@shmsr can you take a look?

shmsr commented 1 day ago

I'll take a look.