DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.83k stars 1.19k forks source link

Datadog agent health check doesn't test docker.sock in kubernetes for availability #4787

Open toha-tk opened 4 years ago

toha-tk commented 4 years ago

Describe the bug Off doc suggest to use datadog as Daemon Set. And it's the right way to do it. But after new node creation, datadog initialized without correctly working docker socket. Sometime connection to it brokes too, but agent heath-check script doesn't test it. Only after pod recreating all works as expected.

The only thing visible in the datadog agent logs: Could not setup the docker launcher: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon runnig?

As result, i've created separate image with testing docker.sock via socat, but i'd like to remote that spike. socat -T1 - UNIX-CLIENT:/var/run/docker.sock

All the time, when there is no connection to docker socket, logs not provided, and it becomes a real problem.

To Reproduce Was not able to find the initial reason of the docker.sock failure. Another way to reproduce: Deploy datadog agent without volumeMount: dockersocket

Expected behavior Datadog agent health check test not only internal services, but also availability to get info from docker. /opt/datadog-agent/bin/agent/agent health Currenly it test components, but incorrectly mounted /var/run/docker.sock will not allow to collect logs.

Output examples # /opt/datadog-agent/bin/agent/agent health Agent health: PASS === 15 healthy components === ad-config-provider-kubernetes, ad-kubeletlistener, ad-servicelistening, aggregator, collector-queue, collector-queue, dogstatsd-main, forwarder, healthcheck, logs-agent, metadata-agent_checks, metadata-host, metadata-inventories, metadata-resources, tagger # socat -T1 - UNIX-CLIENT:/var/run/docker.sock 2020/01/27 14:44:10 socat[734] E connect(5, AF=1 "/var/run/docker.sock", 22): No such file or directory

Environment (please complete the following information):

plumdog commented 4 years ago

Also found this same failure. Kubernetes: 1.13, AWS EKS, but was also unable to identify why, so not sure what the underlying failure is, but I agree that Datadog shouldn't pass the health check if it can't connect to Docker, rather than just logging that message.

toha-tk commented 4 years ago

@plumdog Mate, issue easily solved by increasing Datadog resource limits. Was not able to find, but agent has a limit to the amount of the containers he able to successfully service.

If that's not a workaround for you, you can build your own Datadog Agent, which check with healthcheck for docker.sock

Dockerfile


RUN apt-get update && \
    apt-get install -y socat

ADD socket-test.sh .

RUN cat socket-test.sh >> ./probe.sh

socker-test.sh

socat -T1 - UNIX-CLIENT:/var/run/docker.sock