apache / pulsar

Apache Pulsar - distributed pub-sub messaging system
https://pulsar.apache.org/
Apache License 2.0
14.27k stars 3.59k forks source link

All Prometheus histogram buckets are malformed #13869

Open ofek opened 2 years ago

ofek commented 2 years ago

Describe the bug

As documented in the official spec and mentioned in Pulsar's docs, histogram buckets are suffixed by _bucket with an upper bound le label.

Instead, the label and value is embedded in the metric name as a suffix:

# TYPE pulsar_storage_write_latency_le_0_5 gauge
pulsar_storage_write_latency_le_0_5{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
# TYPE pulsar_storage_write_latency_le_1 gauge
pulsar_storage_write_latency_le_1{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
# TYPE pulsar_storage_write_latency_le_5 gauge
pulsar_storage_write_latency_le_5{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
# TYPE pulsar_storage_write_latency_le_10 gauge
pulsar_storage_write_latency_le_10{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
# TYPE pulsar_storage_write_latency_le_20 gauge
pulsar_storage_write_latency_le_20{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
# TYPE pulsar_storage_write_latency_le_50 gauge
pulsar_storage_write_latency_le_50{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
# TYPE pulsar_storage_write_latency_le_100 gauge
pulsar_storage_write_latency_le_100{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
# TYPE pulsar_storage_write_latency_le_200 gauge
pulsar_storage_write_latency_le_200{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
# TYPE pulsar_storage_write_latency_le_1000 gauge
pulsar_storage_write_latency_le_1000{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
# TYPE pulsar_storage_write_latency_overflow gauge
pulsar_storage_write_latency_overflow{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
# TYPE pulsar_storage_write_latency_count gauge
pulsar_storage_write_latency_count{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
# TYPE pulsar_storage_write_latency_sum gauge
pulsar_storage_write_latency_sum{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
# TYPE pulsar_storage_ledger_write_latency_le_0_5 gauge
pulsar_storage_ledger_write_latency_le_0_5{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
# TYPE pulsar_storage_ledger_write_latency_le_1 gauge
pulsar_storage_ledger_write_latency_le_1{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
# TYPE pulsar_storage_ledger_write_latency_le_5 gauge
pulsar_storage_ledger_write_latency_le_5{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
# TYPE pulsar_storage_ledger_write_latency_le_10 gauge
pulsar_storage_ledger_write_latency_le_10{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
# TYPE pulsar_storage_ledger_write_latency_le_20 gauge
pulsar_storage_ledger_write_latency_le_20{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
# TYPE pulsar_storage_ledger_write_latency_le_50 gauge
pulsar_storage_ledger_write_latency_le_50{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
# TYPE pulsar_storage_ledger_write_latency_le_100 gauge
pulsar_storage_ledger_write_latency_le_100{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
# TYPE pulsar_storage_ledger_write_latency_le_200 gauge
pulsar_storage_ledger_write_latency_le_200{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
# TYPE pulsar_storage_ledger_write_latency_le_1000 gauge
pulsar_storage_ledger_write_latency_le_1000{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
# TYPE pulsar_storage_ledger_write_latency_overflow gauge
pulsar_storage_ledger_write_latency_overflow{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
# TYPE pulsar_storage_ledger_write_latency_count gauge
pulsar_storage_ledger_write_latency_count{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
# TYPE pulsar_storage_ledger_write_latency_sum gauge
pulsar_storage_ledger_write_latency_sum{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
# TYPE pulsar_entry_size_le_128 gauge
pulsar_entry_size_le_128{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
# TYPE pulsar_entry_size_le_512 gauge
pulsar_entry_size_le_512{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
# TYPE pulsar_entry_size_le_1_kb gauge
pulsar_entry_size_le_1_kb{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
# TYPE pulsar_entry_size_le_2_kb gauge
pulsar_entry_size_le_2_kb{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
# TYPE pulsar_entry_size_le_4_kb gauge
pulsar_entry_size_le_4_kb{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
# TYPE pulsar_entry_size_le_16_kb gauge
pulsar_entry_size_le_16_kb{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
# TYPE pulsar_entry_size_le_100_kb gauge
pulsar_entry_size_le_100_kb{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
# TYPE pulsar_entry_size_le_1_mb gauge
pulsar_entry_size_le_1_mb{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
# TYPE pulsar_entry_size_le_overflow gauge
pulsar_entry_size_le_overflow{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
# TYPE pulsar_entry_size_count gauge
pulsar_entry_size_count{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
# TYPE pulsar_entry_size_sum gauge
pulsar_entry_size_sum{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata",partition="-1"} 0.0 1642722619078
pulsar_storage_write_latency_le_0_5{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_storage_write_latency_le_1{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_storage_write_latency_le_5{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_storage_write_latency_le_10{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_storage_write_latency_le_20{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_storage_write_latency_le_50{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_storage_write_latency_le_100{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_storage_write_latency_le_200{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_storage_write_latency_le_1000{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_storage_write_latency_overflow{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_storage_write_latency_count{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_storage_write_latency_sum{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_storage_ledger_write_latency_le_0_5{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_storage_ledger_write_latency_le_1{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_storage_ledger_write_latency_le_5{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_storage_ledger_write_latency_le_10{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_storage_ledger_write_latency_le_20{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_storage_ledger_write_latency_le_50{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_storage_ledger_write_latency_le_100{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_storage_ledger_write_latency_le_200{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_storage_ledger_write_latency_le_1000{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_storage_ledger_write_latency_overflow{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_storage_ledger_write_latency_count{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_storage_ledger_write_latency_sum{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_entry_size_le_128{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_entry_size_le_512{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_entry_size_le_1_kb{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_entry_size_le_2_kb{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_entry_size_le_4_kb{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_entry_size_le_16_kb{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_entry_size_le_100_kb{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_entry_size_le_1_mb{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_entry_size_le_overflow{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_entry_size_count{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_entry_size_sum{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate",partition="-1"} 0.0 1642722619079
pulsar_storage_write_latency_le_0_5{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079
pulsar_storage_write_latency_le_1{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079
pulsar_storage_write_latency_le_5{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079
pulsar_storage_write_latency_le_10{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079
pulsar_storage_write_latency_le_20{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079
pulsar_storage_write_latency_le_50{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079
pulsar_storage_write_latency_le_100{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079
pulsar_storage_write_latency_le_200{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079
pulsar_storage_write_latency_le_1000{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079
pulsar_storage_write_latency_overflow{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079
pulsar_storage_write_latency_count{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079
pulsar_storage_write_latency_sum{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079
pulsar_storage_ledger_write_latency_le_0_5{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079
pulsar_storage_ledger_write_latency_le_1{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079
pulsar_storage_ledger_write_latency_le_5{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079
pulsar_storage_ledger_write_latency_le_10{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079
pulsar_storage_ledger_write_latency_le_20{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079
pulsar_storage_ledger_write_latency_le_50{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079
pulsar_storage_ledger_write_latency_le_100{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079
pulsar_storage_ledger_write_latency_le_200{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079
pulsar_storage_ledger_write_latency_le_1000{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079
pulsar_storage_ledger_write_latency_overflow{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079
pulsar_storage_ledger_write_latency_count{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079
pulsar_storage_ledger_write_latency_sum{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079
pulsar_entry_size_le_128{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079
pulsar_entry_size_le_512{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079
pulsar_entry_size_le_1_kb{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079
pulsar_entry_size_le_2_kb{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079
pulsar_entry_size_le_4_kb{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079
pulsar_entry_size_le_16_kb{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079
pulsar_entry_size_le_100_kb{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079
pulsar_entry_size_le_1_mb{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079
pulsar_entry_size_le_overflow{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079
pulsar_entry_size_count{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079
pulsar_entry_size_sum{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments",partition="-1"} 0.0 1642722619079

To Reproduce

Steps to reproduce the behavior:

curl -L http://localhost:8080/metrics
version: '3'

services:
  pulsar:
    container_name: pulsar
    image: apachepulsar/pulsar:2.9.1
    command:
    - bash
    - -c
    - >
      bin/apply-config-from-env-with-prefix.py BOOKKEEPER_ conf/bookkeeper.conf &&
      bin/apply-config-from-env-with-prefix.py BROKER_ conf/broker.conf &&
      bin/apply-config-from-env-with-prefix.py STANDALONE_ conf/standalone.conf &&
      exec bin/pulsar standalone
    ports:
    - '6650:6650'
    - '8080:8080'
    environment:
    - BOOKKEEPER_enableStatistics=true
    - BOOKKEEPER_prometheusStatsHttpPort=8080
    - BROKER_exposeTopicLevelMetricsInPrometheus=true
    - BROKER_exposeConsumerLevelMetricsInPrometheus=true
    - BROKER_exposeProducerLevelMetricsInPrometheus=true
    - BROKER_exposeManagedLedgerMetricsInPrometheus=true
    - BROKER_exposeManagedCursorMetricsInPrometheus=true
    - BROKER_exposePublisherStats=true
    - BROKER_exposePreciseBacklogInPrometheus=true
    - BROKER_splitTopicAndPartitionLabelInPrometheus=true
    - STANDALONE_exposeTopicLevelMetricsInPrometheus=true
    - STANDALONE_exposeConsumerLevelMetricsInPrometheus=true
    - STANDALONE_exposeProducerLevelMetricsInPrometheus=true
    - STANDALONE_exposeManagedLedgerMetricsInPrometheus=true
    - STANDALONE_exposeManagedCursorMetricsInPrometheus=true
    - STANDALONE_exposePublisherStats=true
    - STANDALONE_exposePreciseBacklogInPrometheus=true
    - STANDALONE_splitTopicAndPartitionLabelInPrometheus=true

Desktop (please complete the following information):

github-actions[bot] commented 2 years ago

The issue had no activity for 30 days, mark with Stale label.

ofek commented 2 years ago

Bump.

tjiuming commented 2 years ago

It's the doc's issue

ofek commented 2 years ago

No, this is broken

github-actions[bot] commented 2 years ago

The issue had no activity for 30 days, mark with Stale label.

ofek commented 2 years ago

bump

github-actions[bot] commented 2 years ago

The issue had no activity for 30 days, mark with Stale label.

github-actions[bot] commented 2 years ago

The issue had no activity for 30 days, mark with Stale label.

asafm commented 2 years ago

Hi @ofek, you are absolutely correct. I've been working over the last several months documenting the current state of metrics and releasing the document to the community just 2-3 weeks ago. As you can see in there, it's a known issue.

This document is part of a large effort to refactor how the metrics are defined, used, and exported in Pulsar.

@codelipenghui @merlimat - we potentially don't have to wait for the full refactor, but provide a fix just for exporting histograms - it's not a small fix, but it's not a complicated fix. the biggest issue is once we do that of course, we break compatibility, so this must be done gradually with flags (oldHistogram=true, newHistogram=false). WDYT?

asafm commented 2 years ago

@ofek I forgot to explain there is another issue you haven't mentioned: histogram bucket values today are delta-resets, meaning most of them are reset every configurable interval (30sec/1min). Prometheus quantile function assumes the values are incremental counters. This is another thing that needs to be fixed. This as well breaks backward compatibility of course.

asafm commented 12 months ago

This will be solved as part of PIP-264 implementation. Parent issue for tracking it is here