DataDog / dd-agent

Datadog Agent Version 5
https://docs.datadoghq.com/
Other
1.3k stars 814 forks source link

Service Discovery Not Working for Kafka Integration #2862

Open paulcichonski opened 7 years ago

paulcichonski commented 7 years ago

_Output of the info page _

2016-09-26 18:31:23,482 | WARNING | dd.collector | utils.service_discovery.config(config.py:32) | No configuration backend provided for service discovery. Only auto config templates will be used.
2016-09-26 18:31:23,483 | DEBUG | dd.collector | utils.proxy(proxy.py:68) | No proxy configured
2016-09-26 18:31:23,590 | DEBUG | dd.collector | docker.auth.auth(auth.py:189) | File doesn't exist
2016-09-26 18:31:23,599 | DEBUG | dd.collector | utils.subprocess_output(subprocess_output.py:63) | Popen(['/bin/hostname', '-f'], close_fds = True, shell = False, stdout = <open file '<fdopen>', mode 'w+b' at 0x7fb8a5897420>, stderr = <open file '<fdopen>', mode 'w+b' at 0x7fb8a5897150>, stdin = None) called
===================
Collector (v 5.8.5)
===================

  Status date: 2016-09-26 18:31:08 (15s ago)
  Pid: 19
  Platform: Linux-3.16.0-4-amd64-x86_64-with-debian-8.6
  Python Version: 2.7.12, 64bit
  Logs: <stderr>, /var/log/datadog/collector.log

  Clocks
  ======

    NTP offset: -0.1651 s
    System UTC time: 2016-09-26 18:31:23.771015

  Paths
  =====

    conf.d: /etc/dd-agent/conf.d
    checks.d: /opt/datadog-agent/agent/checks.d

  Hostnames
  =========

    socket-hostname: 51d46fad06c3
    hostname: docker-daemon
    socket-fqdn: 51d46fad06c3

  Checks
  ======

    ntp
    ---
      - Collected 0 metrics, 0 events & 1 service check

    disk
    ----
      - instance #0 [OK]
      - Collected 40 metrics, 0 events & 1 service check

    docker_daemon
    -------------
      - instance #0 [OK]
      - Collected 66 metrics, 0 events & 2 service checks

  Emitters
  ========

    - http_emitter [OK]

===================
Dogstatsd (v 5.8.5)
===================

  Status date: 2016-09-26 18:31:16 (7s ago)
  Pid: 15
  Platform: Linux-3.16.0-4-amd64-x86_64-with-debian-8.6
  Python Version: 2.7.12, 64bit
  Logs: <stderr>, /var/log/datadog/dogstatsd.log

  Flush count: 15
  Packet Count: 0
  Packets per second: 0.0
  Metric count: 1
  Event count: 0
  Service check count: 0

===================
Forwarder (v 5.8.5)
===================

  Status date: 2016-09-26 18:31:21 (2s ago)
  Pid: 17
  Platform: Linux-3.16.0-4-amd64-x86_64-with-debian-8.6
  Python Version: 2.7.12, 64bit
  Logs: <stderr>, /var/log/datadog/forwarder.log

  Queue Size: 0 bytes
  Queue Length: 0
  Flush Count: 52
  Transactions received: 22
  Transactions flushed: 22
  Transactions rejected: 0

Additional environment details (Operating System, Cloud provider, etc): I have replicated this using the following local setup:

docker-compose.yml:

dd-agent:
  build: dd-agent
  volumes:
    - /var/run/docker.sock:/var/run/docker.sock
    - /proc/:/host/proc/:ro
    - /sys/fs/cgroup/:/host/sys/fs/cgroup:ro
    - ./auto_conf:/etc/dd-agent/conf.d/auto_conf
  environment:
    - API_KEY
    - SD_BACKEND=docker
    - LOG_LEVEL=DEBUG
  links:
    - kafka
kafka:
  build: kafka
  environment:
    - JMX_PORT=9999
  links:
    - zookeeper
  labels:
    com.datadoghq.sd.check.id: kafka
zookeeper:
  build: zookeeper

./auto_conf/kafka.yaml:

docker_images:
  - kafka

init_config:
  is_jmx: true
  conf:
    - include:
        domain: 'kafka.server'
        bean: 'kafka.server:type=BrokerTopicMetrics,name=FailedFetchRequestsPerSec'
        attribute:
          Count:
            metric_type: rate
            alias: kafka.request.fetch.failed.rate
instances:
  - host: "%%host%%"
    port: 9999
    tags:
      kafka: broker

./dd-agent/Dockerfile:

FROM datadog/docker-dd-agent:latest

# Install JMXFetch dependencies
RUN apt-get update \
&& apt-get install openjdk-7-jre-headless -qq --no-install-recommends

./kafka/Dockerfile:

FROM java:jre

RUN apt-get update

ENV KAFKA_VERSION 0.10.0.1
RUN wget -q -O - http://www-us.apache.org/dist/kafka/$KAFKA_VERSION/kafka_2.11-$KAFKA_VERSION.tgz | tar -zxf - -C /opt
RUN ln -s /opt/kafka_2.11-$KAFKA_VERSION /opt/kafka

VOLUME ["/tmp/kafka-logs"]
EXPOSE 9092

## toy example
RUN echo "zookeeper.connect=zookeeper" >> /opt/kafka/config/server.properties
RUN echo "broker.id=1" >> /opt/kafka/config/server.properties

WORKDIR /opt/kafka
ENTRYPOINT ["bin/kafka-server-start.sh", "config/server.properties"]

./zookeeper/Dockerfile:

FROM java:jre

RUN apt-get update

ENV ZOOKEEPER_VERSION 3.4.8
RUN wget -q -O - http://www.us.apache.org/dist/zookeeper/zookeeper-$ZOOKEEPER_VERSION/zookeeper-$ZOOKEEPER_VERSION.tar.gz | tar -xzf - -C /opt \
 && cd /opt/zookeeper-$ZOOKEEPER_VERSION
RUN ln -s /opt/zookeeper-$ZOOKEEPER_VERSION /opt/zookeeper

RUN mkdir -p /var/lib/zookeeper
VOLUME ["/var/lib/zookeeper"]

EXPOSE 2181 2888 3888

COPY zoo.cfg /opt/zookeeper/conf/
ENTRYPOINT ["/opt/zookeeper/bin/zkServer.sh", "start-foreground"]

./zookeeper/zoo.cfg:

tickTime=2000
dataDir=/var/lib/zookeeper
clientPort=2181
initLimit=5
syncLimit=2

Steps to reproduce the issue:

  1. API_KEY=<dd-token> docker-compose up -d dd-agent

Describe the results you received: Kafka check is not configured.

Describe the results you expected: Kafka check should be configured and healthy

Additional information you deem important (e.g. issue happens only occasionally): Here is the output of service datadog-agent configcheck from within the container:

2016-09-26 18:37:20,174 | WARNING | dd.collector | utils.service_discovery.config(config.py:32) | No configuration backend provided for service discovery. Only auto config templates will be used.
2016-09-26 18:37:20,175 | DEBUG | dd.collector | utils.proxy(proxy.py:68) | No proxy configured
2016-09-26 18:37:20,181 | DEBUG | dd.collector | docker.auth.auth(auth.py:189) | File doesn't exist
2016-09-26 18:37:20,192 | DEBUG | dd.collector | utils.subprocess_output(subprocess_output.py:63) | Popen(['/bin/hostname', '-f'], close_fds = True, shell = False, stdout = <open file '<fdopen>', mode 'w+b' at 0x7fc653e28270>, stderr = <open file '<fdopen>', mode 'w+b' at 0x7fc653e28150>, stdin = None) called
docker_daemon.yaml is valid
All yaml files passed. You can now run the Datadog agent.

Loading check configurations...

2016-09-26 18:37:20,260 | DEBUG | dd.collector | utils.subprocess_output(subprocess_output.py:63) | Popen(['/bin/hostname', '-f'], close_fds = True, shell = False, stdout = <open file '<fdopen>', mode 'w+b' at 0x7fc653e28150>, stderr = <open file '<fdopen>', mode 'w+b' at 0x7fc653e28270>, stdin = None) called
2016-09-26 18:37:20,327 | DEBUG | dd.collector | config(config.py:867) | No sdk integrations path found
2016-09-26 18:37:20,330 | DEBUG | dd.collector | docker.auth.auth(auth.py:189) | File doesn't exist
2016-09-26 18:37:20,333 | DEBUG | dd.collector | config(config.py:952) | Loaded /opt/datadog-agent/agent/checks.d/docker_daemon.py
2016-09-26 18:37:20,334 | DEBUG | dd.collector | config(config.py:952) | Loaded /opt/datadog-agent/agent/checks.d/ntp.py
2016-09-26 18:37:20,335 | DEBUG | dd.collector | config(config.py:952) | Loaded /opt/datadog-agent/agent/checks.d/agent_metrics.py
2016-09-26 18:37:20,336 | WARNING | dd.collector | checks.disk(disk.py:80) | Using `use_mount` in datadog.conf has been deprecated in favor of `use_mount` in disk.yaml
2016-09-26 18:37:20,337 | DEBUG | dd.collector | config(config.py:952) | Loaded /opt/datadog-agent/agent/checks.d/disk.py
2016-09-26 18:37:20,337 | INFO | dd.collector | config(config.py:834) | Fetching service discovery check configurations.
2016-09-26 18:37:20,337 | DEBUG | dd.collector | docker.auth.auth(auth.py:189) | File doesn't exist
2016-09-26 18:37:20,345 | WARNING | dd.collector | utils.service_discovery.sd_docker_backend(sd_docker_backend.py:289) | No supported configuration backend was provided, using auto-config only.
2016-09-26 18:37:20,345 | DEBUG | dd.collector | utils.service_discovery.abstract_config_store(abstract_config_store.py:98) | No auto config was found for image ddtestjmx_dd-agent, leaving it alone.
2016-09-26 18:37:20,345 | DEBUG | dd.collector | utils.service_discovery.sd_docker_backend(sd_docker_backend.py:260) | No config template for container 2c75bdb0f156 with identifier ddtestjmx_dd-agent. It will be left unconfigured.
2016-09-26 18:37:20,347 | WARNING | dd.collector | utils.service_discovery.sd_docker_backend(sd_docker_backend.py:289) | No supported configuration backend was provided, using auto-config only.
2016-09-26 18:37:20,348 | DEBUG | dd.collector | config(config.py:867) | No sdk integrations path found
2016-09-26 18:37:20,348 | WARNING | dd.collector | utils.checkfiles(checkfiles.py:49) | Failed to load the check class for kafka.
2016-09-26 18:37:20,348 | INFO | dd.collector | utils.service_discovery.abstract_config_store(abstract_config_store.py:73) | Could not find an auto configuration template for kafka. Leaving it unconfigured.
2016-09-26 18:37:20,348 | DEBUG | dd.collector | utils.service_discovery.abstract_config_store(abstract_config_store.py:98) | No auto config was found for image kafka, leaving it alone.
2016-09-26 18:37:20,348 | DEBUG | dd.collector | utils.service_discovery.sd_docker_backend(sd_docker_backend.py:260) | No config template for container 0f0b9b5c21be with identifier kafka. It will be left unconfigured.
2016-09-26 18:37:20,350 | WARNING | dd.collector | utils.service_discovery.sd_docker_backend(sd_docker_backend.py:289) | No supported configuration backend was provided, using auto-config only.
2016-09-26 18:37:20,350 | DEBUG | dd.collector | utils.service_discovery.abstract_config_store(abstract_config_store.py:98) | No auto config was found for image ddtestjmx_zookeeper, leaving it alone.
2016-09-26 18:37:20,351 | DEBUG | dd.collector | utils.service_discovery.sd_docker_backend(sd_docker_backend.py:260) | No config template for container 6fee9e605478 with identifier ddtestjmx_zookeeper. It will be left unconfigured.
2016-09-26 18:37:20,351 | INFO | dd.collector | config(config.py:1024) | initialized checks.d checks: ['ntp', 'disk', 'docker_daemon']
2016-09-26 18:37:20,351 | INFO | dd.collector | config(config.py:1025) | initialization failed checks.d checks: []

Source of the configuration objects built by the agent:

Check "agent_metrics":
  source --> YAML file
  config --> {'instances': [{}], 'init_config': {'process_metrics': [{'active': True, 'type': 'gauge', 'name': 'memory_info'}, {'active': True, 'type': 'rate', 'name': 'io_counters'}, {'active': True, 'type': 'gauge', 'name': 'num_threads'}, {'active': False, 'type': 'gauge', 'name': 'connections'}]}}

Check "ntp":
  source --> YAML file
  config --> {'instances': [{'offset_threshold': 60}], 'init_config': None}

Check "disk":
  source --> YAML file
  config --> {'instances': [{'use_mount': False}], 'init_config': None}

Check "docker_daemon":
  source --> YAML file
  config --> {'instances': [{'url': 'unix://var/run/docker.sock'}], 'init_config': {'docker_root': '/host'}}

2016-09-26 18:37:20,353 | DEBUG | dd.collector | docker.auth.auth(auth.py:189) | File doesn't exist

Containers info:

Number of containers found: 3
    - ID: 2c75bdb0f156 image: ddtestjmx_dd-agent name: ddtestjmx_dd-agent_1
    - ID: 0f0b9b5c21be image: ddtestjmx_kafka name: ddtestjmx_kafka_1
    - ID: 6fee9e605478 image: ddtestjmx_zookeeper name: ddtestjmx_zookeeper_1

The key line being:

2016-09-26 18:37:20,348 | WARNING | dd.collector | utils.checkfiles(checkfiles.py:49) | Failed to load the check class for kafka.

I'm guessing this is because under the covers the kafka check uses https://github.com/DataDog/dd-agent/blob/master/jmxfetch.py#L165 to configure the jmx metrics pulling and the service discovery stuff doesn't seem to support that yet (just a guess though).

Any ideas how to make this work with the service discovery mechanism?

Note that I've verified this works fine if I change my setup to use the static /conf.d mechanism:

docker-compose.yml:

dd-agent:
  build: dd-agent
  volumes:
    - /var/run/docker.sock:/var/run/docker.sock
    - /proc/:/host/proc/:ro
    - /sys/fs/cgroup/:/host/sys/fs/cgroup:ro
    - ./auto_conf:/conf.d
  environment:
    - API_KEY
    - SD_BACKEND=docker
    - LOG_LEVEL=DEBUG
  links:
    - kafka
kafka:
  build: kafka
  environment:
    - JMX_PORT=9999
  links:
    - zookeeper
zookeeper:
  build: zookeeper

./auto_conf/kakfa.yaml:

init_config:
  is_jmx: true
  conf:
    - include:
        domain: 'kafka.server'
        bean: 'kafka.server:type=BrokerTopicMetrics,name=FailedFetchRequestsPerSec'
        attribute:
          Count:
            metric_type: rate
            alias: kafka.request.fetch.failed.rate
instances:
  - host: kafka
    port: 9999
    tags:
      kafka: broker
paulcichonski commented 7 years ago

Note: I get the same Failed to load the check class for log line when trying to use service discovery for the cassandra.yaml integration. I'm guessing thats also related to the jmx usage.

I can post more details of my cassandra setup if it helps (looks very similar to above).

Thanks!

mikekap commented 7 years ago

I think the core issue is that this is simply unimplemented. Kafka monitoring is a JMX monitor, which uses a standalone jmxfetch binary (rather than collector). That binary only looks in conf.d and ignores any service autodiscovery settings.

For now, I'm monitoring Kafka via packaging up the jmxfetch jar into a docker container and running that as a sidecar to kafka.

paulcichonski commented 7 years ago

Thanks @mikekap , I did not realize the jmxfetch was a separate binary. Will experiment with the sidecar approach.

hkaj commented 7 years ago

Hi @paulcichonski As @mikekap mentioned this is not implemented yet, sorry for the confusion. I'll consider this issue as a feature request though (as it's not the first time it is asked). Stay tuned :)