DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.84k stars 1.19k forks source link

AutoDiscovery in docker swarm mode doesn't work #4019

Open WTK opened 5 years ago

WTK commented 5 years ago

Output of the info page (if this is a bug)

Getting the status from the agent.

===============
Agent (v6.13.0)
===============

  Status date: 2019-08-15 07:51:24.322393 UTC
  Agent start: 2019-08-15 07:48:52.614063 UTC
  Pid: 334
  Go Version: go1.11.5
  Python Version: 2.7.16
  Check Runners: 4
  Log Level: DEBUG

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    NTP offset: 8.318ms
    System UTC time: 2019-08-15 07:51:24.322393 UTC

  Host Info
  =========
    bootTime: 2019-08-12 05:00:30.000000 UTC
    kernelVersion: 4.15.0-55-generic
    os: linux
    platform: debian
    platformFamily: debian
    platformVersion: 10.0
    procs: 69
    uptime: 74h48m44s
    virtualizationRole: guest
    virtualizationSystem: docker

  Hostnames
  =========
    hostname: wtk-desktop
    socket-fqdn: 43ac444f6df8
    socket-hostname: 43ac444f6df8
    host tags:
      docker_swarm_node_role:manager
    hostname provider: container
    unused hostname providers:
      aws: not retrieving hostname from AWS: the host is not an ECS instance, and other providers already retrieve non-default hostnames
      configuration/environment: hostname is empty
      gce: unable to retrieve hostname from GCE: Get http://169.254.169.254/computeMetadata/v1/instance/hostname: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

=========
Collector
=========

  Running Checks
  ==============

    cpu
    ---
      Instance ID: cpu [OK]
      Total Runs: 9
      Metric Samples: Last Run: 6, Total: 48
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s

    disk (2.4.0)
    ------------
      Instance ID: disk:e5dffb8bef24336f [OK]
      Total Runs: 9
      Metric Samples: Last Run: 256, Total: 2,304
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 902ms

    docker
    ------
      Instance ID: docker [OK]
      Total Runs: 8
      Metric Samples: Last Run: 78, Total: 624
      Events: Last Run: 0, Total: 2
      Service Checks: Last Run: 1, Total: 8
      Average Execution Time : 351ms

    file_handle
    -----------
      Instance ID: file_handle [OK]
      Total Runs: 9
      Metric Samples: Last Run: 5, Total: 45
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s

    io
    --
      Instance ID: io [OK]
      Total Runs: 8
      Metric Samples: Last Run: 572, Total: 4,180
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 187ms

    load
    ----
      Instance ID: load [OK]
      Total Runs: 9
      Metric Samples: Last Run: 6, Total: 54
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s

    memory
    ------
      Instance ID: memory [OK]
      Total Runs: 8
      Metric Samples: Last Run: 17, Total: 136
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 12ms

    network (1.11.0)
    ----------------
      Instance ID: network:e0204ad63d43c949 [OK]
      Total Runs: 9
      Metric Samples: Last Run: 68, Total: 612
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 79ms

    ntp
    ---
      Instance ID: ntp:b4579e02d1981c12 [OK]
      Total Runs: 8
      Metric Samples: Last Run: 1, Total: 8
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 8
      Average Execution Time : 45ms

    uptime
    ------
      Instance ID: uptime [OK]
      Total Runs: 9
      Metric Samples: Last Run: 1, Total: 9
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s

========
JMXFetch
========

  Initialized checks
  ==================
    no checks

  Failed checks
  =============
    no checks

=========
Forwarder
=========

  Transactions
  ============
    CheckRunsV1: 9
    Dropped: 0
    DroppedOnInput: 0
    Events: 0
    HostMetadata: 0
    IntakeV1: 4
    Metadata: 0
    Requeued: 0
    Retried: 0
    RetryQueueSize: 0
    Series: 0
    ServiceChecks: 0
    SketchSeries: 0
    Success: 22
    TimeseriesV1: 9

  API Keys status
  ===============
    API key ending with 703ef: API Key valid

==========
Endpoints
==========
  https://app.datadoghq.com - API Key ending with:
      - 703ef

==========
Logs Agent
==========

  Logs Agent is not running

=========
Aggregator
=========
  Checks Metric Sample: 8,177
  Dogstatsd Metric Sample: 24
  Event: 3
  Events Flushed: 3
  Number Of Flushes: 9
  Series Flushed: 5,622
  Service Check: 95
  Service Checks Flushed: 95

=========
DogStatsD
=========
  Event Packets: 0
  Event Parse Errors: 0
  Metric Packets: 23
  Metric Parse Errors: 0
  Service Check Packets: 0
  Service Check Parse Errors: 0
  Udp Bytes: 1,257
  Udp Packet Reading Errors: 0
  Udp Packets: 24
  Uds Bytes: 0
  Uds Origin Detection Errors: 0
  Uds Packet Reading Errors: 0
  Uds Packets: 0

Describe what happened: After running a basic docker swarm that consists of nginx and datadog agent, agent fails to resolve IP address of the nginx service and skips this service. As far as I can tell this isn't specific to nginx, but rather to code resulting in this error https://github.com/DataDog/datadog-agent/blob/fd70e65090d3f88c8208f127ab52a9cd383ce658/pkg/autodiscovery/configresolver/configresolver.go#L164

Relevant logs:

2019-08-15 07:49:12 UTC | CORE | DEBUG | (pkg/tagger/tagger.go:246 in Tag) | cache miss for docker, collecting tags for container_id://443947ccb1b185a42781fefb4e055e040e074c0775f9ff0b3824b421b57a548a
2019-08-15 07:49:12 UTC | CORE | DEBUG | (pkg/autodiscovery/configresolver/configresolver.go:136 in getHost) | Network "" not found, trying bridge IP instead
2019-08-15 07:49:12 UTC | CORE | WARN | (pkg/autodiscovery/autoconfig.go:529 in resolveTemplateForService) | error resolving template nginx for service docker://443947ccb1b185a42781fefb4e055e040e074c0775f9ff0b3824b421b57a548a: failed to resolve IP address for container docker://443947ccb1b185a42781fefb4e055e040e074c0775f9ff0b3824b421b57a548a, ignoring it. Source error: not able to determine which network is reachable
2019-08-15 07:49:12 UTC | CORE | DEBUG | (pkg/autodiscovery/autoconfig.go:280 in processNewConfig) | Can't resolve the template for nginx at this moment.

Describe what you expected: Agent should pick up on nginx service, resolve its ip and process docker tags correctly set up.

Steps to reproduce the issue: Using this basic docker-compose.yml file run docker stack deploy -c docker-compose.yml foo to start a docker swarm. Agent won't correctly discover nginx service. Do the same without swarm mode, and use docker-compose instead and you're golden - everything works as expected.

Additional environment details (Operating System, Cloud provider, etc):

judge2020 commented 4 years ago

A temporary workaround i've found is where, when datadog is ran within the same swarm stack+network as your service, you replace %%host%% with the name of the swarm service, eg. you would use nginx has the hostname for that basic compose file.

      com.datadoghq.ad.instances: '[{"nginx_status_url": "http://nginx:81/nginx_status"}]'

This only works when running a single instance of the service per machine, and when the agent is a global deployment, but that fits my use case for the time being.