DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.91k stars 1.21k forks source link

[BUG] Prometheus Histogram handling seems to be broke #17001

Open datsabk opened 1 year ago

datsabk commented 1 year ago

Agent Environment v7.39.2-jmx

Describe what happened: I am generating sample prometheus metrics for my app as shown below:

# TYPE promtest_metric_ops_total counter
promtest_metric_ops_total 334
# HELP promtest_metric_request_duration_seconds Duration of the request.
# TYPE promtest_metric_request_duration_seconds histogram
promtest_metric_request_duration_seconds_bucket{le="0.1"} 56
promtest_metric_request_duration_seconds_bucket{le="5"} 140
promtest_metric_request_duration_seconds_bucket{le="10"} 252
promtest_metric_request_duration_seconds_bucket{le="+Inf"} 334
promtest_metric_request_duration_seconds_sum 1852.8399999999992
promtest_metric_request_duration_seconds_count 334

However, Datadog agent doesn't seem to be generating the expected percentiles/aggregates/distribution metrics.

Describe what you expected: Generates aggregations and percentile metrics

Steps to reproduce the issue:

Additional environment details (Operating System, Cloud provider, etc):

Datadog YAML:

    dogstatsd_buffer_size: 65535
    dogstatsd_so_rcvbuf: 16777216
    histogram_aggregates:
    - max
    - median
    - avg
    - count
    - sum
    histogram_copy_to_distribution: true
    histogram_percentiles:
    - "0.95"
    - "0.90"
    - "0.50"
    jmx_use_container_support: true
    tags:
    - environment:stage
    - env:stage
    - region:us-west-2
goodspark commented 1 year ago

I'm seeing this too.

DD agent: v7.39.2 Agent is running as a DaemonSet on a k8s cluster w the v3.1.3 DD Helm chart.

Relevant Helm chart values:

datadog:
  prometheusScrape:
    enabled: true
    additionalConfigs:
      - configurations:
        - send_distribution_buckets: true

k8s deployment with autodiscovery annotations:

    metadata:
      annotations:
        ad.datadoghq.com/myapp.check_names: '["openmetrics"]'
        ad.datadoghq.com/myapp.init_configs: '[{}]'
        ad.datadoghq.com/myapp.instances: |
          [
            {
              "openmetrics_endpoint": "http://%%host%%:2112/",
              "namespace": "myapp",
              "metrics": [
                "myapp*"
              ]
            }
          ]

FWIW, I am seeing Prometheus metrics go through. It's just that histograms aren't being converted into distributions, like datsabk said.

I followed this guide: https://docs.datadoghq.com/agent/kubernetes/prometheus/

chadxzs commented 1 year ago

@datsabk @goodspark did you all find a workaround? We are thinking maybe this is a regression and are about to try downgrading and/or upgrading our agent, because we are also on the same version (v7.39.2).

goodspark commented 1 year ago

Yeah. It turned out to be poor documentation from Datadog. I'm not at computer right now but once I'm back I can give my notes.

On Thu, Jul 13, 2023, 9:36 AM Chad McElligott @.***> wrote:

@datsabk https://github.com/datsabk @goodspark https://github.com/goodspark did you all find a workaround? We are thinking maybe this is a regression and are about to try downgrading and/or upgrading our agent, because we are also on the same version (v7.39.2).

— Reply to this email directly, view it on GitHub https://github.com/DataDog/datadog-agent/issues/17001#issuecomment-1634552979, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG63M7LRX3UKJPHRISZNLHTXQAPW7ANCNFSM6AAAAAAXZ2LUX4 . You are receiving this because you were mentioned.Message ID: @.***>

datsabk commented 1 year ago

@goodspark Look forward to more details on what you found. I haven't been able to solve this successfully yet. I did encounter other instances of poor documentation certainly

goodspark commented 1 year ago

So after a back and forth with DD support, it turned out we needed to add an annotation to the pods as well:

    metadata:
      annotations:
        ad.datadoghq.com/myapp.check_names: '["openmetrics"]'
        ad.datadoghq.com/myapp.init_configs: '[{}]'
        ad.datadoghq.com/myapp.instances: |
          [
            {
              "openmetrics_endpoint": "http://%%host%%:2112/",
              "namespace": "myapp",
              "metrics": [
                "myapp*"
-              ]
+              ],
+              "histogram_buckets_as_distributions": true
            }
          ]
chadxzs commented 1 year ago

Hm, ok, interesting. I'll add that. FWIW, this is what I have now that isn't working:

      ad.datadoghq.com/apollo-router.checks: |
          {
            "openmetrics": {
              "instances": [
                {
                  "openmetrics_endpoint": "http://%%host%%:9090/metrics",
                  "send_distribution_buckets": true,
                  "metrics": [
                    "apollo_router_processing_time.*"
                  ]
                }
              ]
            }
          }
chadxzs commented 1 year ago

With your setting @goodspark it works :)

        ad.datadoghq.com/apollo-router.checks: |
          {
            "openmetrics": {
              "instances": [
                {
                  "openmetrics_endpoint": "http://%%host%%:9090/metrics",
-                  "send_distribution_buckets": true,
+                  "histogram_buckets_as_distributions": true,
                  "namespace": "bingo",
                  "metrics": [
                    "apollo_router_processing_time.*"
                  ]
                }
              ]
            }
          }
image

See also https://github.com/DataDog/integrations-core/issues/5883#issuecomment-962156502