DataDog / integrations-core

Core integrations of the Datadog Agent
BSD 3-Clause "New" or "Revised" License
932 stars 1.4k forks source link

Consul NodeName (NodeId) in Service Checks #12609

Closed hjkatz closed 2 years ago

hjkatz commented 2 years ago

Note: If you have a feature request, you should contact support so the request can be properly tracked.

Output of the info page

Getting the status from the agent.

===============
Agent (v7.37.1)
===============

  Status date: 2022-07-27 15:16:34.997 UTC (1658934994997)
  Agent start: 2022-07-22 18:34:00.519 UTC (1658514840519)
  Pid: 18465
  Go Version: go1.17.11
  Python Version: 3.8.11
  Build arch: amd64
  Agent flavor: agent
  Check Runners: 4
  Log Level: info

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    NTP offset: 385µs
    System time: 2022-07-27 15:16:34.997 UTC (1658934994997)

  Host Info
  =========
    bootTime: 2021-09-23 20:11:50 UTC (1632427910000)
    hostId: <redacted>
    kernelArch: x86_64
    kernelVersion: 4.9.0-16-amd64
    os: linux
    platform: debian
    platformFamily: debian
    platformVersion: 9.13
    procs: 121
    uptime: 7246h22m12s

  Hostnames
  =========
  <redacted>

  Metadata
  ========
    agent_version: 7.37.1
    cloud_provider: AWS
    config_apm_dd_url: 
    config_dd_url: https://app.datadoghq.com
    config_logs_dd_url: 
    config_logs_socks5_proxy_address: 
    config_no_proxy: []
    config_process_dd_url: 
    config_proxy_http: 
    config_proxy_https: 
    config_site: 
    feature_apm_enabled: true
    feature_cspm_enabled: false
    feature_cws_enabled: false
    feature_logs_enabled: false
    feature_networks_enabled: false
    feature_networks_http_enabled: false
    feature_networks_https_enabled: false
    feature_otlp_enabled: false
    feature_process_enabled: false
    feature_processes_container_enabled: true
    flavor: agent
    hostname_source: os
    install_method_installer_version: deb_package
    install_method_tool: dpkg
    install_method_tool_version: dpkg-1.18.26

=========
Collector
=========

  Running Checks
  ==============

    consul (2.1.0)
    --------------
      Instance ID: consul:default:b77b05cc5a5351d9 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/consul.d/conf.yaml
      Total Runs: 28,011
      Metric Samples: Last Run: 1, Total: 28,011
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 2, Total: 57,359
      Average Execution Time : 4ms
      Last Execution Date : 2022-07-27 15:16:32 UTC (1658934992000)
      Last Successful Execution Date : 2022-07-27 15:16:32 UTC (1658934992000)
      metadata:
        version.major: 1
        version.minor: 8
        version.patch: 4
        version.raw: 1.8.4
        version.scheme: semver

  <redacted>

=========
Forwarder
=========

  Transactions
  ============
    Cluster: 0
    ClusterRole: 0
    ClusterRoleBinding: 0
    CronJob: 0
    DaemonSet: 0
    Deployment: 0
    Dropped: 0
    HighPriorityQueueFull: 0
    Ingress: 0
    Job: 0
    Node: 0
    PersistentVolume: 0
    PersistentVolumeClaim: 0
    Pod: 0
    ReplicaSet: 0
    Requeued: 0
    Retried: 0
    RetryQueueSize: 0
    Role: 0
    RoleBinding: 0
    Service: 0
    ServiceAccount: 0
    StatefulSet: 0

  Transaction Successes
  =====================
    Total number: 59055
    Successes By Endpoint:
      check_run_v1: 28,010
      intake: 2,335
      metadata_v1: 700
      series_v1: 28,010

  On-disk storage
  ===============
    On-disk storage is disabled. Configure `forwarder_storage_max_size_in_bytes` to enable it.

  API Keys status
  ===============
    API key ending with 91d2c: API Key valid

==========
Endpoints
==========
  https://app.datadoghq.com - API Key ending with:
      - 91d2c

==========
Logs Agent
==========

  Logs Agent is not running

=============
Process Agent
=============

  Version: 7.37.1
  Status date: 2022-07-27 15:16:44.135 UTC (1658935004135)
  Process Agent Start: 2022-07-22 18:34:00.573 UTC (1658514840573)
  Pid: 18466
  Go Version: go1.17.11
  Build arch: amd64
  Log Level: info
  Enabled Checks: [process_discovery]
  Allocated Memory: 13,024,816 bytes
  Hostname: <redacted> # consul-server-host (leader)

  =================
  Process Endpoints
  =================
    https://process.datadoghq.com - API Key ending with:
        - 91d2c

  =========
  Collector
  =========
    Last collection time: 2022-07-27 14:34:01
    Docker socket: 
    Number of processes: 0
    Number of containers: 0
    Process Queue length: 0
    RTProcess Queue length: 0
    Pod Queue length: 0
    Process Bytes enqueued: 0
    RTProcess Bytes enqueued: 0
    Pod Bytes enqueued: 0
    Drop Check Payloads: []
=========
APM Agent
=========
  <redacted>

=========
Aggregator
=========
  Checks Metric Sample: 6,678,903
  Dogstatsd Metric Sample: 4,565,629
  Event: 1
  Events Flushed: 1
  Number Of Flushes: 28,010
  Series Flushed: 8,834,639
  Service Check: 310,384
  Service Checks Flushed: 338,390
=========
DogStatsD
=========
  Event Packets: 0
  Event Parse Errors: 0
  Metric Packets: 4,565,628
  Metric Parse Errors: 0
  Service Check Packets: 0
  Service Check Parse Errors: 0
  Udp Bytes: 406,735,605
  Udp Packet Reading Errors: 0
  Udp Packets: 2,650,131
  Uds Bytes: 0
  Uds Origin Detection Errors: 0
  Uds Packet Reading Errors: 0
  Uds Packets: 0
  Unterminated Metric Errors: 0

====
OTLP
====

  Status: Not enabled
  Collector status: Not running

Additional environment details (Operating System, Cloud provider, etc):

Steps to reproduce the issue:

  1. Install the consul.d check/integration
  2. ????
  3. Non-profit

Describe the results you received:

The output of the Consul Service Checks for Consul Service Healthchecks does not include a node, node_name, nor node_id tag or information on the Datadog Service Checks.

Describe the results you expected:

A tag or information should exist for node, node_name, or node_id on the Datadog Service Check (since the information is available and retrieved from the Consul API).

Additional information you deem important (e.g. issue happens only occasionally):

The problem is as follows: Consul Service Checks have information such as Ok, Warning, Critical for the Service, Check (id), and Node (which host the check is failing for). However, the Datadog Consul integration does not seem to gather that Node Name/Id bit of information. So, when a Datadog Consul Service Check is in the Critical state (like consul.check) the information provided only gives details about the Consul Service and Check Name/Id... which is not particularly useful because what happens when you have a Consul Service with 50 Nodes? Which Node has the check failing?

The tag should be added here: https://github.com/DataDog/integrations-core/blob/f8c50c779dc836e9419326a5d2d64524f3216821/consul/datadog_checks/consul/consul.py#L367-L375

Specifically on/after line 373:

if check["Node"]:
    tags.append("consul_node:{}".format(check["Node"]))
sc[sc_id] = {'status': status, 'tags': tags}

The data is available and returned in the Consul API endpoint /v1/health/state/any on line 356: https://github.com/DataDog/integrations-core/blob/f8c50c779dc836e9419326a5d2d64524f3216821/consul/datadog_checks/consul/consul.py#L356

See: https://www.consul.io/api-docs/health#sample-response-3

Example Response:


[
  {
    "Node": "foobar",
    "CheckID": "serfHealth",
    "Name": "Serf Health Status",
    "Status": "passing",
    "Notes": "",
    "Output": "",
    "ServiceID": "",
    "ServiceName": "",
    "ServiceTags": [],
    "Namespace": "default"
  },
  [...]
]
hjkatz commented 2 years ago

Note: I was unable to open a Datadog Support ticket for this issue/feature request because the provided link and support center did not have an option available for such tickets/questions/requests.

If you can point me in the right direction, I'll be happy to open a ticket.

Alternatively, if this feature seems low-hanging enough, I am also happy to submit a PR to add this tag information (even behind a flag/option if desired).

Additionally, this Node Name information is available in the Telegraf consul plugin, but this plugin is not ideal because the metrics collected from Consul in this manner and submitted to Datadog API are considered custom metrics (and are thus billed differently).

See:

FlorentClarret commented 2 years ago

Hi @hjkatz, thanks for opening this issue and the great description!

I created a card in our backlog to work on this. However, we would be also happy to review your PR if you want to take care of this.

hjkatz commented 2 years ago

@FlorentClarret Thanks for responding, here's the PR: https://github.com/DataDog/integrations-core/pull/12675