Closed mhaamann closed 5 years ago
@mfpierre could you please have another look at the issue. I collected more information and clarified my issue. I have not been able to resolve it.
Hey @mhaamann we're looking into the issue, one way we were able to reproduce it was to have checks taking a long time to run. Do you have a significant number of checks or can you spot in the logs some checks that are taking a significant amount of time to run?
Thanks @mhaamann for the investigation, we're creating an issue on our end to optimize health_service_check_whitelist
which is currently creating some load inspecting a bunch of containers, and probably causing the collector-queue
to pile-up and eventually becoming unhealthy.
We'll be also investigating how to find a better way to handle the health of the collector-queue
that currently can become unhealthy because of some individual checks taking a long time to run.
Sorry for removing the comment. Right after I posted the collector-queue failed again.
Even with the vanilla docker.yaml file the collector-queue
fails.
docker.yaml:
init_config: null
instances:
- url: unix://var/run/docker.sock
At least its now very clear that the offending file is somehow docker.yaml. All the other checks does not influence anything.
@mhaamann could you start the agent with a debug
log_level
and see how much time the docker
check takes to run? (time between 2018-03-21 10:29:25 UTC | DEBUG | (runner.go:248 in work) | Running check docker
& 2018-03-21 10:29:25 UTC | DEBUG | (runner.go:304 in work) | Done running check docker
)
I have added the log here starting from Running check docker
https://gist.github.com/mhaamann/33ab70de931639e9c0e1717d90e8619f
It never gets to the point Done running check docker
@mhaamann could you send the output of curl http://localhost:5000/debug/pprof/goroutine?debug=2
to investigate where the check is stuck
(see https://github.com/DataDog/datadog-agent/blob/master/docs/dev/tools.md)
The output is here: https://gist.github.com/mhaamann/d20c074a1ee24f26bf8d842e60f789d8 I am not sure what to look for.
@mhaamann it seems there's an issue with docker, we're making a docker inspect call that seems to never come back (we currently don't have a timeout on docker call, we'll probably add one)
goroutine 112 [select, 30 minutes]:
net/http.(*persistConn).roundTrip(0xc42007de60, 0xc420883320, 0x0, 0x0, 0x0)
/root/.gimme/versions/go1.9.4.linux.amd64/src/net/http/transport.go:1970 +0x608
net/http.(*Transport).RoundTrip(0xc420204a50, 0xc420585600, 0xc420204a50, 0x0, 0x0)
/root/.gimme/versions/go1.9.4.linux.amd64/src/net/http/transport.go:413 +0x992
net/http.send(0xc420585600, 0x3003200, 0xc420204a50, 0x0, 0x0, 0x0, 0xc42000eae8, 0x57, 0xc4204cae00, 0x1)
/root/.gimme/versions/go1.9.4.linux.amd64/src/net/http/client.go:249 +0x1a9
net/http.(*Client).send(0xc420445080, 0xc420585600, 0x0, 0x0, 0x0, 0xc42000eae8, 0x0, 0x1, 0x1)
/root/.gimme/versions/go1.9.4.linux.amd64/src/net/http/client.go:173 +0xfd
net/http.(*Client).Do(0xc420445080, 0xc420585600, 0xc420014048, 0xc420585600, 0x57)
/root/.gimme/versions/go1.9.4.linux.amd64/src/net/http/client.go:602 +0x28d
github.com/DataDog/datadog-agent/vendor/golang.org/x/net/context/ctxhttp.Do(0x3022bc0, 0xc420014048, 0xc420445080, 0xc420585500, 0x0, 0x0, 0x0)
/builds/datadog/datadog-agent/.omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/vendor/golang.org/x/net/context/ctxhttp/ctxhttp.go:30 +0x6e
github.com/DataDog/datadog-agent/vendor/github.com/docker/docker/client.(*Client).doRequest(0xc42047e000, 0x3022bc0, 0xc420014048, 0xc420585500, 0x57, 0x0, 0x0, 0x0, 0xc420585500, 0x0)
/builds/datadog/datadog-agent/.omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/vendor/github.com/docker/docker/client/request.go:127 +0x6e
github.com/DataDog/datadog-agent/vendor/github.com/docker/docker/client.(*Client).sendRequest(0xc42047e000, 0x3022bc0, 0xc420014048, 0x1f2bd49, 0x3, 0xc420b102a0, 0x51, 0x0, 0x0, 0x0, ...)
/builds/datadog/datadog-agent/.omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/vendor/github.com/docker/docker/client/request.go:121 +0x13e
github.com/DataDog/datadog-agent/vendor/github.com/docker/docker/client.(*Client).get(0xc42047e000, 0x3022bc0, 0xc420014048, 0xc420b102a0, 0x51, 0x0, 0x0, 0xc420b102a0, 0x51, 0x0, ...)
/builds/datadog/datadog-agent/.omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/vendor/github.com/docker/docker/client/request.go:36 +0xb5
github.com/DataDog/datadog-agent/vendor/github.com/docker/docker/client.(*Client).ContainerInspect(0xc42047e000, 0x3022bc0, 0xc420014048, 0xc420261280, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/builds/datadog/datadog-agent/.omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/vendor/github.com/docker/docker/client/container_inspect.go:16 +0xf3
github.com/DataDog/datadog-agent/pkg/util/docker.(*DockerUtil).dockerContainers(0xc4202d20c0, 0xc4204cba8e, 0x0, 0x0, 0x0, 0xc420264100, 0x2f)
/builds/datadog/datadog-agent/.omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/util/docker/docker_util.go:174 +0x9d3
github.com/DataDog/datadog-agent/pkg/util/docker.(*DockerUtil).Containers(0xc4202d20c0, 0xc4204cba8e, 0x0, 0x0, 0x0, 0x2, 0x2)
/builds/datadog/datadog-agent/.omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/util/docker/docker_util.go:250 +0x11ed
github.com/DataDog/datadog-agent/pkg/collector/corechecks/containers.(*DockerCheck).Run(0xc4200b9dc0, 0x3c4052046, 0x3085b60)
/builds/datadog/datadog-agent/.omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/collector/corechecks/containers/docker.go:156 +0x246
Do you have troubles with general docker usage on the host? ps
or inspect
taking a long time to return?
docker ps works fine, the host is only running 20 containers. You are right that docker inspect
seems to be unresponsive for 2 containers...
@mhaamann so it's definitely a bug on docker's side but we'll try to mitigate it with a timeout
on our side to avoid having the docker check hanging forever.
Anything special about these containers so we can reproduce & document the issue?
I am guessing that the bug on dockers side is related to: https://github.com/moby/moby/issues/35933 Anyway, a timeout in Datadog seems reasonable.
Datadog not working actually helped us track this docker issue a lot better. We have been struggling to find a way to detect it.
I am having a similar issue with collector-queue going unhealthy after varying periods of time. I don't have a Docker check so I doubt this is related to Docker.
If i kill the 2-3 zombie processes i have, using kill [process_id]
since docker kill
is not responding, and make sure that all containers responds to docker inspect
, then the collector-queue works fine for me. At least it has been running for ~30 minutes now.
@renan maybe you have a check that times out as well somewhere? I suggest that you do the same as me, and remove one check at a time to find the culprit.
@mhaamann Thanks. I will look into that. Especially because I have a custom check that collects quite some metrics.
Meanwhile, I have looked into the output of pprof
while it was all healthy and with the unhealthy component. The only real difference between the two is this:
goroutine 117 [syscall, 25 minutes, locked to thread]:
github.com/DataDog/datadog-agent/vendor/github.com/sbinet/go-python._Cfunc__gopy_PyObject_CallMethod(0x7fdb88992050, 0x7fdb5c001270, 0xc400000001, 0x7fdb5c001290, 0xc420082078, 0x0)
github.com/DataDog/datadog-agent/vendor/github.com/sbinet/go-python/_obj/_cgo_gotypes.go:5465 +0x4e
github.com/DataDog/datadog-agent/vendor/github.com/sbinet/go-python.(*PyObject).CallMethod.func5(0xc420082078, 0x1, 0x1, 0x7fdb88992050, 0x7fdb5c001270, 0x1, 0x7fdb5c001290, 0xc420082078, 0x8)
/builds/datadog/datadog-agent/.omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/vendor/github.com/sbinet/go-python/object.go:407 +0x144
github.com/DataDog/datadog-agent/vendor/github.com/sbinet/go-python.(*PyObject).CallMethod(0xc4204f29a0, 0x1f2f0eb, 0x3, 0xc420815df0, 0x1, 0x1, 0x0)
/builds/datadog/datadog-agent/.omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/vendor/github.com/sbinet/go-python/object.go:402 +0x5dd
github.com/DataDog/datadog-agent/pkg/collector/py.(*PythonCheck).Run(0xc4200138c0, 0x0, 0x0)
/builds/datadog/datadog-agent/.omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/collector/py/check.go:65 +0x21f
github.com/DataDog/datadog-agent/pkg/collector/runner.(*Runner).work(0xc420476b40)
/builds/datadog/datadog-agent/.omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/collector/runner/runner.go:255 +0x4a1
created by github.com/DataDog/datadog-agent/pkg/collector/runner.(*Runner).AddWorker
/builds/datadog/datadog-agent/.omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/collector/runner/runner.go:100 +0x8a
Here's the complete output: https://gist.github.com/renan/d722573c19c1a99150aa7f8451de840c
Hi @renan, the pprof
output does point to a python check taking a very long time to run (at the time of the profile, the check was running its current iteration for the past 25 minutes). I'd recommend adding logging to your custom check to identify where it's hanging or what's taking so long.
Alternatively/On top of this, if you'd like the other checks to continue running even when your custom check is stuck, you can increase the number of check runners (default: 1
), for instance to 2
in datadog.yaml
(see https://github.com/DataDog/datadog-agent/blob/6.0.3/pkg/config/config_template.yaml#L92-L96).
Thanks @olivielpeau, that helped. I was able to track it down to StatsD check getting stuck. I created a PR to fix it in https://github.com/DataDog/integrations-core/pull/1363 👍
Closing this issue, it should be fixed now:
4
, both introduced in v6.5.0, the Agent does not become unhealthy when one single check is stuck
I experienced the same issue as reported in https://github.com/DataDog/datadog-agent/issues/1469. Updated Datadog agent from 5 to 6 via Datadog Ansible. I changed the docker conf file from docker_daemon.yaml to docker.yaml which removed the errors.
But now nothing is displayed in the Datadog Web Application. How do i get the web application to work again?
Output from datadog-agent configcheck
Updated
Changed log_level to DEBUG in the datadog.yaml and restarted the datadog-agent. After some seconds the health looks good:
But after approximate a minute the health becomes unhealthy: Running
datadog-agent health
returnsAt the same moment agent.log shows:
Environment information: Docker: 17.12.1-ce Kernel Version: 4.15.6-1.el7.elrepo.x86_64 Operating System: CentOS Linux 7 (Core)
Result
No information is displayed in the Datadog Web Interface from the server.