DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.88k stars 1.21k forks source link

autodiscovery can't get metrics without wildcard or regex #11436

Open vl-shopback opened 2 years ago

vl-shopback commented 2 years ago

Output of the info page (if this is a bug)

2022-03-24 10:09:02 UTC | CORE | DEBUG | (pkg/collector/python/datadog_agent.go:128 in LogMessage) | - | (connectionpool.py:456) | http://10.244.1.57:9898 "GET /metrics HTTP/1.1" 200 1769
2022-03-24 10:09:02 UTC | CORE | DEBUG | (pkg/collector/python/datadog_agent.go:128 in LogMessage) | openmetrics:k38s:673e649a47d0b63c | (transform.py:80) | Skipping metric `http_request_duration_seconds` as it is not defined in `metrics`
2022-03-24 10:09:02 UTC | CORE | DEBUG | (pkg/collector/python/datadog_agent.go:128 in LogMessage) | openmetrics:k38s:673e649a47d0b63c | (transform.py:80) | Skipping metric `http_requests` as it is not defined in `metrics`
2022-03-24 10:09:02 UTC | CORE | DEBUG | (pkg/collector/python/datadog_agent.go:128 in LogMessage) | openmetrics:k38s:673e649a47d0b63c | (transform.py:80) | Skipping metric `process_cpu_seconds` as it is not defined in `metrics`
2022-03-24 10:09:02 UTC | CORE | DEBUG | (pkg/collector/python/datadog_agent.go:128 in LogMessage) | openmetrics:k38s:673e649a47d0b63c | (transform.py:80) | Skipping metric `process_open_fds` as it is not defined in `metrics`
2022-03-24 10:09:02 UTC | CORE | DEBUG | (pkg/collector/python/datadog_agent.go:128 in LogMessage) | openmetrics:k38s:673e649a47d0b63c | (transform.py:80) | Skipping metric `process_resident_memory_bytes` as it is not defined in `metrics`
2022-03-24 10:09:02 UTC | CORE | DEBUG | (pkg/collector/python/datadog_agent.go:128 in LogMessage) | openmetrics:k38s:673e649a47d0b63c | (transform.py:80) | Skipping metric `process_start_time_seconds` as it is not defined in `metrics`
2022-03-24 10:09:02 UTC | CORE | DEBUG | (pkg/collector/python/datadog_agent.go:128 in LogMessage) | openmetrics:k38s:673e649a47d0b63c | (transform.py:80) | Skipping metric `promhttp_metric_handler_requests` as it is not defined in `metrics`
2022-03-24 10:09:02 UTC | CORE | DEBUG | (pkg/collector/worker/check_logger.go:58 in CheckFinished) | check:openmetrics | Done running check

Describe what happened: can't get metric: process_cpu_seconds_total, others are okay

Describe what you expected: it should send all matched metrics to datadog

Steps to reproduce the issue:

  ad.datadoghq.com/podinfo.check_names: '["openmetrics"]'
  ad.datadoghq.com/podinfo.init_configs: '[{}]'
  ad.datadoghq.com/podinfo.instances: |
    [
      {
        "openmetrics_endpoint": "http://%%host%%:9898/metrics",
        "namespace": "k38s",
        "metrics": [{"promhttp_metric_handler_requests_in_flight":"podinfo_requests_in_flight"}, "process_virtual_.*", "process_cpu_seconds_total", "^process_(max|openx)_fds", "go_.*"],
        "exclude_metrics":
          [
            "go_gc_.*",
            "^go_memstats_.*_bytes$"
          ]
      }
    ]

Additional environment details (Operating System, Cloud provider, etc): image

~ $ curl -s http://localhost:9898/metrics
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 3.2478e-05
go_gc_duration_seconds{quantile="0.25"} 6.4613e-05
go_gc_duration_seconds{quantile="0.5"} 7.9964e-05
go_gc_duration_seconds{quantile="0.75"} 9.0612e-05
go_gc_duration_seconds{quantile="1"} 0.00015215
go_gc_duration_seconds_sum 0.000935805
go_gc_duration_seconds_count 12
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 13
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.17.8"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 1.5787904e+07
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 6.8325672e+07
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 1.461488e+06
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 407026
# HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started.
# TYPE go_memstats_gc_cpu_fraction gauge
go_memstats_gc_cpu_fraction 3.930378458628937e-06
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 5.252704e+06
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 1.5787904e+07
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 3.588096e+06
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 1.67936e+07
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 26973
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes gauge
go_memstats_heap_released_bytes 524288
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 2.0381696e+07
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 1.6481194711950855e+09
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 0
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 433999
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 2400
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 16384
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 78200
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 114688
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 2.5241584e+07
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 577720
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 589824
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 589824
# HELP go_memstats_sys_bytes Number of bytes obtained from system.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 2.8394504e+07
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
go_threads 9
# HELP http_request_duration_seconds Seconds spent serving HTTP requests.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{method="GET",path="healthz",status="200",le="0.005"} 127
http_request_duration_seconds_bucket{method="GET",path="healthz",status="200",le="0.01"} 127
http_request_duration_seconds_bucket{method="GET",path="healthz",status="200",le="0.025"} 127
http_request_duration_seconds_bucket{method="GET",path="healthz",status="200",le="0.05"} 127
http_request_duration_seconds_bucket{method="GET",path="healthz",status="200",le="0.1"} 127
http_request_duration_seconds_bucket{method="GET",path="healthz",status="200",le="0.25"} 127
http_request_duration_seconds_bucket{method="GET",path="healthz",status="200",le="0.5"} 127
http_request_duration_seconds_bucket{method="GET",path="healthz",status="200",le="1"} 127
http_request_duration_seconds_bucket{method="GET",path="healthz",status="200",le="2.5"} 127
http_request_duration_seconds_bucket{method="GET",path="healthz",status="200",le="5"} 127
http_request_duration_seconds_bucket{method="GET",path="healthz",status="200",le="10"} 127
http_request_duration_seconds_bucket{method="GET",path="healthz",status="200",le="+Inf"} 127
http_request_duration_seconds_sum{method="GET",path="healthz",status="200"} 0.009265723000000002
http_request_duration_seconds_count{method="GET",path="healthz",status="200"} 127
http_request_duration_seconds_bucket{method="GET",path="metrics",status="200",le="0.005"} 425
http_request_duration_seconds_bucket{method="GET",path="metrics",status="200",le="0.01"} 425
http_request_duration_seconds_bucket{method="GET",path="metrics",status="200",le="0.025"} 425
http_request_duration_seconds_bucket{method="GET",path="metrics",status="200",le="0.05"} 425
http_request_duration_seconds_bucket{method="GET",path="metrics",status="200",le="0.1"} 425
http_request_duration_seconds_bucket{method="GET",path="metrics",status="200",le="0.25"} 425
http_request_duration_seconds_bucket{method="GET",path="metrics",status="200",le="0.5"} 425
http_request_duration_seconds_bucket{method="GET",path="metrics",status="200",le="1"} 425
http_request_duration_seconds_bucket{method="GET",path="metrics",status="200",le="2.5"} 425
http_request_duration_seconds_bucket{method="GET",path="metrics",status="200",le="5"} 425
http_request_duration_seconds_bucket{method="GET",path="metrics",status="200",le="10"} 425
http_request_duration_seconds_bucket{method="GET",path="metrics",status="200",le="+Inf"} 425
http_request_duration_seconds_sum{method="GET",path="metrics",status="200"} 0.43235149800000006
http_request_duration_seconds_count{method="GET",path="metrics",status="200"} 425
http_request_duration_seconds_bucket{method="GET",path="readyz",status="200",le="0.005"} 129
http_request_duration_seconds_bucket{method="GET",path="readyz",status="200",le="0.01"} 129
http_request_duration_seconds_bucket{method="GET",path="readyz",status="200",le="0.025"} 129
http_request_duration_seconds_bucket{method="GET",path="readyz",status="200",le="0.05"} 129
http_request_duration_seconds_bucket{method="GET",path="readyz",status="200",le="0.1"} 129
http_request_duration_seconds_bucket{method="GET",path="readyz",status="200",le="0.25"} 129
http_request_duration_seconds_bucket{method="GET",path="readyz",status="200",le="0.5"} 129
http_request_duration_seconds_bucket{method="GET",path="readyz",status="200",le="1"} 129
http_request_duration_seconds_bucket{method="GET",path="readyz",status="200",le="2.5"} 129
http_request_duration_seconds_bucket{method="GET",path="readyz",status="200",le="5"} 129
http_request_duration_seconds_bucket{method="GET",path="readyz",status="200",le="10"} 129
http_request_duration_seconds_bucket{method="GET",path="readyz",status="200",le="+Inf"} 129
http_request_duration_seconds_sum{method="GET",path="readyz",status="200"} 0.008190198000000003
http_request_duration_seconds_count{method="GET",path="readyz",status="200"} 129
# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{status="200"} 681
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0.84
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 12
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 4.591616e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.64811826012e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 7.4805248e+08
# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
# TYPE process_virtual_memory_max_bytes gauge
process_virtual_memory_max_bytes 1.8446744073709552e+19
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
# TYPE promhttp_metric_handler_requests_in_flight gauge
promhttp_metric_handler_requests_in_flight 1
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 425
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0
randallt commented 2 years ago

I am seeing a similar issue. It only seems to affect counter metric types. We're on dd agent v7.32.4.

randallt commented 2 years ago

It looks like this is a bug in the V2 OpenMetrics check. When we changed back to the V1 OpenMetrics check by changing 'openmetrics_endpoint' to 'prometheus_url', then it worked as expected.

yzhan289 commented 2 years ago

@vl-shopback @randallt When specifying the metric, can you also include the type: counter? By default, metrics are submitted as gauge, but counters (and histogram and summary) need to be specified.

randallt commented 2 years ago

@yzhan289 Why would the type correctly be detected (I assume from the # TYPE lines in the /metrics output) by the datadog agent sometimes and not other times? I have lots of counters and histograms correctly submitted without specifying their type in the annotation yaml.

yzhan289 commented 2 years ago

I have lots of counters and histograms correctly submitted without specifying their type in the annotation yaml.

@randallt Are these correctly submitted on V1 or V2?

BEvgeniyS commented 2 years ago

I hit the same issue today, reverting to v1 helped. Thanks @randallt !

jbasement commented 2 years ago

I had the same issue but it turned out I just ran into the default limit of 2000 metrics per agent. I found it out by running agent check openmetrics at the agent hosted on the same node as the deployment/ service providing the metrics. The limit of 2000 metrics is also documented here.

varkey commented 1 year ago

Noticed this in the docs. Since the metric name ends with _total, you'd need to put the name without the _total

    ## Note: To collect counter metrics with names ending in `_total`, specify the metric name without the `_total`
    ## suffix. For example, to collect the counter metric `promhttp_metric_handler_requests_total`, specify
    ## `promhttp_metric_handler_requests`. This submits to Datadog the metric name appended with `.count`.
    ## For more information, see:
    ## https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#suffixes
    ##
NicolasTr commented 1 year ago

+1, it works when removing _total from the metric name in the Kubernetes annotation.

I think the information should be added this page, in the table mentioning <METRIC_TO_FETCH>.

ry0suke17 commented 11 months ago

"^process_(max|openx)_fds"

On another note, can you get the above metrics? I tried, but I can't get it using ^.

Is there a regexp I can't use?