collector-queue becomes unhealthy after 1 minute

mhaamann commented 6 years ago

I experienced the same issue as reported in https://github.com/DataDog/datadog-agent/issues/1469. Updated Datadog agent from 5 to 6 via Datadog Ansible. I changed the docker conf file from docker_daemon.yaml to docker.yaml which removed the errors.

But now nothing is displayed in the Datadog Web Application. How do i get the web application to work again?

Output from datadog-agent configcheck

--- docker check ---
Instance 1:
new_tag_names: true
url: unix://var/run/docker.sock
~
---

Updated

Changed log_level to DEBUG in the datadog.yaml and restarted the datadog-agent. After some seconds the health looks good:

datadog-agent health
Agent health: PASS
=== 13 healthy components ===
ad-autoconfig, ad-configresolver, aggregator, collector-queue, dockerutil-event-dispatch, dogstatsd-main, forwarder, healthcheck, metadata-agent_checks, metadata-host, metadata-resources, tagger, tagger-docker

But after approximate a minute the health becomes unhealthy: Running datadog-agent health returns

Agent health: FAIL
=== 12 healthy components ===
ad-autoconfig, ad-configresolver, aggregator, dockerutil-event-dispatch, dogstatsd-main, forwarder, healthcheck, metadata-agent_checks, metadata-host, metadata-resources, tagger, tagger-docker
=== 1 unhealthy components ===
collector-queue
Error: found 1 unhealthy components

At the same moment agent.log shows:

2018-03-20 15:33:13 CET | DEBUG | (kubelet.go:445 in init) | Cannot connect: Get http://docker03:10255/: dial tcp [2a01:4f8:10b:2944::2]:10255: getsockopt: connection refused
2018-03-20 15:33:13 CET | DEBUG | (kubelet.go:87 in GetKubeUtil) | Init error: temporary failure in kubeutil, will retry later: cannot connect: https: "Get https://docker03:10250/: dial tcp [2a01:4f8:10b:2944::2]:10250: getsockopt: connection refused", http: "Get http://docker03:10255/: dial tcp [2a01:4f8:10b:2944::2]:10255: getsockopt: connection refused"
2018-03-20 15:33:13 CET | DEBUG | (tagger.go:145 in tryCollectors) | will retry kubelet later: temporary failure in kubeutil, will retry later: cannot connect: https: "Get https://docker03:10250/: dial tcp [2a01:4f8:10b:2944::2]:10250: getsockopt: connection refused", http: "Get http://docker03:10255/: dial tcp [2a01:4f8:10b:2944::2]:10255: getsockopt: connection refused"
2018-03-20 15:33:13 CET | DEBUG | (kubelet.go:87 in GetKubeUtil) | Init error: temporary failure in kubeutil, will retry later: try delay not elapsed yet
2018-03-20 15:33:13 CET | DEBUG | (tagger.go:145 in tryCollectors) | will retry kube-service-collector later: temporary failure in kubeutil, will retry later: try delay not elapsed yet
2018-03-20 15:33:14 CET | DEBUG | (aggregator.go:352 in func1) | Flushing 1 service checks to the forwarder
2018-03-20 15:33:14 CET | DEBUG | (aggregator.go:310 in func1) | Flushing 19 series to the forwarder
2018-03-20 15:33:14 CET | DEBUG | (split.go:52 in Payloads) | The payload was not too big, returning the full payload
2018-03-20 15:33:14 CET | DEBUG | (split.go:52 in Payloads) | The payload was not too big, returning the full payload
2018-03-20 15:33:14 CET | DEBUG | (transaction.go:142 in Process) | Successfully posted payload to "https://6-0-3-app.agent.datadoghq.com/api/v1/check_run?api_key=*************************45bde": {"status": "ok"}
2018-03-20 15:33:14 CET | DEBUG | (transaction.go:142 in Process) | Successfully posted payload to "https://6-0-3-app.agent.datadoghq.com/api/v1/series?api_key=*************************45bde": {"status": "ok"}

Environment information: Docker: 17.12.1-ce Kernel Version: 4.15.6-1.el7.elrepo.x86_64 Operating System: CentOS Linux 7 (Core)

Result

No information is displayed in the Datadog Web Interface from the server.

mhaamann commented 6 years ago

@mfpierre could you please have another look at the issue. I collected more information and clarified my issue. I have not been able to resolve it.

mfpierre commented 6 years ago

Hey @mhaamann we're looking into the issue, one way we were able to reproduce it was to have checks taking a long time to run. Do you have a significant number of checks or can you spot in the logs some checks that are taking a significant amount of time to run?

mfpierre commented 6 years ago

Thanks @mhaamann for the investigation, we're creating an issue on our end to optimize health_service_check_whitelist which is currently creating some load inspecting a bunch of containers, and probably causing the collector-queue to pile-up and eventually becoming unhealthy.

We'll be also investigating how to find a better way to handle the health of the collector-queue that currently can become unhealthy because of some individual checks taking a long time to run.

mhaamann commented 6 years ago

Sorry for removing the comment. Right after I posted the collector-queue failed again.

mhaamann commented 6 years ago

Even with the vanilla docker.yaml file the collector-queue fails.

docker.yaml:

init_config: null
instances:
-   url: unix://var/run/docker.sock

At least its now very clear that the offending file is somehow docker.yaml. All the other checks does not influence anything.

mfpierre commented 6 years ago

@mhaamann could you start the agent with a debug log_level and see how much time the docker check takes to run? (time between 2018-03-21 10:29:25 UTC | DEBUG | (runner.go:248 in work) | Running check docker & 2018-03-21 10:29:25 UTC | DEBUG | (runner.go:304 in work) | Done running check docker)

mhaamann commented 6 years ago

I have added the log here starting from Running check docker https://gist.github.com/mhaamann/33ab70de931639e9c0e1717d90e8619f

It never gets to the point Done running check docker

mfpierre commented 6 years ago

@mhaamann could you send the output of curl http://localhost:5000/debug/pprof/goroutine?debug=2 to investigate where the check is stuck (see https://github.com/DataDog/datadog-agent/blob/master/docs/dev/tools.md)

mhaamann commented 6 years ago

The output is here: https://gist.github.com/mhaamann/d20c074a1ee24f26bf8d842e60f789d8 I am not sure what to look for.

mfpierre commented 6 years ago

@mhaamann it seems there's an issue with docker, we're making a docker inspect call that seems to never come back (we currently don't have a timeout on docker call, we'll probably add one)

goroutine 112 [select, 30 minutes]:
net/http.(*persistConn).roundTrip(0xc42007de60, 0xc420883320, 0x0, 0x0, 0x0)
    /root/.gimme/versions/go1.9.4.linux.amd64/src/net/http/transport.go:1970 +0x608
net/http.(*Transport).RoundTrip(0xc420204a50, 0xc420585600, 0xc420204a50, 0x0, 0x0)
    /root/.gimme/versions/go1.9.4.linux.amd64/src/net/http/transport.go:413 +0x992
net/http.send(0xc420585600, 0x3003200, 0xc420204a50, 0x0, 0x0, 0x0, 0xc42000eae8, 0x57, 0xc4204cae00, 0x1)
    /root/.gimme/versions/go1.9.4.linux.amd64/src/net/http/client.go:249 +0x1a9
net/http.(*Client).send(0xc420445080, 0xc420585600, 0x0, 0x0, 0x0, 0xc42000eae8, 0x0, 0x1, 0x1)
    /root/.gimme/versions/go1.9.4.linux.amd64/src/net/http/client.go:173 +0xfd
net/http.(*Client).Do(0xc420445080, 0xc420585600, 0xc420014048, 0xc420585600, 0x57)
    /root/.gimme/versions/go1.9.4.linux.amd64/src/net/http/client.go:602 +0x28d
github.com/DataDog/datadog-agent/vendor/golang.org/x/net/context/ctxhttp.Do(0x3022bc0, 0xc420014048, 0xc420445080, 0xc420585500, 0x0, 0x0, 0x0)
    /builds/datadog/datadog-agent/.omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/vendor/golang.org/x/net/context/ctxhttp/ctxhttp.go:30 +0x6e
github.com/DataDog/datadog-agent/vendor/github.com/docker/docker/client.(*Client).doRequest(0xc42047e000, 0x3022bc0, 0xc420014048, 0xc420585500, 0x57, 0x0, 0x0, 0x0, 0xc420585500, 0x0)
    /builds/datadog/datadog-agent/.omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/vendor/github.com/docker/docker/client/request.go:127 +0x6e
github.com/DataDog/datadog-agent/vendor/github.com/docker/docker/client.(*Client).sendRequest(0xc42047e000, 0x3022bc0, 0xc420014048, 0x1f2bd49, 0x3, 0xc420b102a0, 0x51, 0x0, 0x0, 0x0, ...)
    /builds/datadog/datadog-agent/.omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/vendor/github.com/docker/docker/client/request.go:121 +0x13e
github.com/DataDog/datadog-agent/vendor/github.com/docker/docker/client.(*Client).get(0xc42047e000, 0x3022bc0, 0xc420014048, 0xc420b102a0, 0x51, 0x0, 0x0, 0xc420b102a0, 0x51, 0x0, ...)
    /builds/datadog/datadog-agent/.omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/vendor/github.com/docker/docker/client/request.go:36 +0xb5
github.com/DataDog/datadog-agent/vendor/github.com/docker/docker/client.(*Client).ContainerInspect(0xc42047e000, 0x3022bc0, 0xc420014048, 0xc420261280, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /builds/datadog/datadog-agent/.omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/vendor/github.com/docker/docker/client/container_inspect.go:16 +0xf3
github.com/DataDog/datadog-agent/pkg/util/docker.(*DockerUtil).dockerContainers(0xc4202d20c0, 0xc4204cba8e, 0x0, 0x0, 0x0, 0xc420264100, 0x2f)
    /builds/datadog/datadog-agent/.omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/util/docker/docker_util.go:174 +0x9d3
github.com/DataDog/datadog-agent/pkg/util/docker.(*DockerUtil).Containers(0xc4202d20c0, 0xc4204cba8e, 0x0, 0x0, 0x0, 0x2, 0x2)
    /builds/datadog/datadog-agent/.omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/util/docker/docker_util.go:250 +0x11ed
github.com/DataDog/datadog-agent/pkg/collector/corechecks/containers.(*DockerCheck).Run(0xc4200b9dc0, 0x3c4052046, 0x3085b60)
/builds/datadog/datadog-agent/.omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/collector/corechecks/containers/docker.go:156 +0x246

Do you have troubles with general docker usage on the host? ps or inspect taking a long time to return?

mhaamann commented 6 years ago

docker ps works fine, the host is only running 20 containers. You are right that docker inspect seems to be unresponsive for 2 containers...

mfpierre commented 6 years ago

@mhaamann so it's definitely a bug on docker's side but we'll try to mitigate it with a timeout on our side to avoid having the docker check hanging forever.

Anything special about these containers so we can reproduce & document the issue?

mhaamann commented 6 years ago

I am guessing that the bug on dockers side is related to: https://github.com/moby/moby/issues/35933 Anyway, a timeout in Datadog seems reasonable.

Datadog not working actually helped us track this docker issue a lot better. We have been struggling to find a way to detect it.

renan commented 6 years ago

I am having a similar issue with collector-queue going unhealthy after varying periods of time. I don't have a Docker check so I doubt this is related to Docker.

mhaamann commented 6 years ago

If i kill the 2-3 zombie processes i have, using kill [process_id] since docker kill is not responding, and make sure that all containers responds to docker inspect, then the collector-queue works fine for me. At least it has been running for ~30 minutes now.

@renan maybe you have a check that times out as well somewhere? I suggest that you do the same as me, and remove one check at a time to find the culprit.

renan commented 6 years ago

@mhaamann Thanks. I will look into that. Especially because I have a custom check that collects quite some metrics.

Meanwhile, I have looked into the output of pprof while it was all healthy and with the unhealthy component. The only real difference between the two is this:

goroutine 117 [syscall, 25 minutes, locked to thread]:
github.com/DataDog/datadog-agent/vendor/github.com/sbinet/go-python._Cfunc__gopy_PyObject_CallMethod(0x7fdb88992050, 0x7fdb5c001270, 0xc400000001, 0x7fdb5c001290, 0xc420082078, 0x0)
    github.com/DataDog/datadog-agent/vendor/github.com/sbinet/go-python/_obj/_cgo_gotypes.go:5465 +0x4e
github.com/DataDog/datadog-agent/vendor/github.com/sbinet/go-python.(*PyObject).CallMethod.func5(0xc420082078, 0x1, 0x1, 0x7fdb88992050, 0x7fdb5c001270, 0x1, 0x7fdb5c001290, 0xc420082078, 0x8)
    /builds/datadog/datadog-agent/.omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/vendor/github.com/sbinet/go-python/object.go:407 +0x144
github.com/DataDog/datadog-agent/vendor/github.com/sbinet/go-python.(*PyObject).CallMethod(0xc4204f29a0, 0x1f2f0eb, 0x3, 0xc420815df0, 0x1, 0x1, 0x0)
    /builds/datadog/datadog-agent/.omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/vendor/github.com/sbinet/go-python/object.go:402 +0x5dd
github.com/DataDog/datadog-agent/pkg/collector/py.(*PythonCheck).Run(0xc4200138c0, 0x0, 0x0)
    /builds/datadog/datadog-agent/.omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/collector/py/check.go:65 +0x21f
github.com/DataDog/datadog-agent/pkg/collector/runner.(*Runner).work(0xc420476b40)
    /builds/datadog/datadog-agent/.omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/collector/runner/runner.go:255 +0x4a1
created by github.com/DataDog/datadog-agent/pkg/collector/runner.(*Runner).AddWorker
    /builds/datadog/datadog-agent/.omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/collector/runner/runner.go:100 +0x8a

Here's the complete output: https://gist.github.com/renan/d722573c19c1a99150aa7f8451de840c

olivielpeau commented 6 years ago

Hi @renan, the pprof output does point to a python check taking a very long time to run (at the time of the profile, the check was running its current iteration for the past 25 minutes). I'd recommend adding logging to your custom check to identify where it's hanging or what's taking so long.

Alternatively/On top of this, if you'd like the other checks to continue running even when your custom check is stuck, you can increase the number of check runners (default: 1), for instance to 2 in datadog.yaml (see https://github.com/DataDog/datadog-agent/blob/6.0.3/pkg/config/config_template.yaml#L92-L96).

renan commented 6 years ago

Thanks @olivielpeau, that helped. I was able to track it down to StatsD check getting stuck. I created a PR to fix it in https://github.com/DataDog/integrations-core/pull/1363 👍

olivielpeau commented 5 years ago

Closing this issue, it should be fixed now:

the root cause was that the docker check was stuck on requests made to the docker daemon when the docker daemon is unresponsive. This was solved by introducing a timeout on all docker requests, released in Agent v6.2.1
thanks to improvements to the check scheduling logic and the increase of the default number of check runners to 4, both introduced in v6.5.0, the Agent does not become unhealthy when one single check is stuck

DataDog / datadog-agent

collector-queue becomes unhealthy after 1 minute #1487

Updated

Result