appcelerator-archive / amp

** THIS PROJECT IS STOPPED ** An open source CaaS for Docker, batteries included.
http://appcelerator.io
Apache License 2.0
81 stars 28 forks source link

Telegraph Failing to capture container stats #254

Closed generalhenry closed 8 years ago

generalhenry commented 8 years ago

locally (ubuntu@macbook) I'm running into errors such as

2016/09/26 18:39:15 Error gathering container [/registry.1.5o68h3f201xpqzqhvment4teo] stats: Error getting docker stats: An error occurred trying to connect: context deadline exceeded

And not getting any container stats.

ndegory commented 8 years ago

@generalhenry this means that Docker doesn't reply to the telegraf plugin on time (10 sec by defaut, which is already a lot). We can set a higher timeout with the INTERVAL env var, but I don't think it's a good workaround. Can you restart your Docker daemon and check that you still have this issue?

generalhenry commented 8 years ago
docker service rm `docker service ls -q`
docker rm -f `docker ps -aq`
sudo service docker restart
sudo ./swarm start
./swarm monitor   (wait for all 1/1)
docker logs (telegraph container)

still same results context deadline exceeded

ndegory commented 8 years ago

@generalhenry can you please try to reproduce with different settings on the telegraf-agent service? shorter gathering period: -e INTERVAL=5s (should have same issue) longer gathering period: -e INTERVAL=20s -e FLUSH_INTERVAL=22s (should not timeout)

ndegory commented 8 years ago

I noticed that the timeout happens when there's a high activity (typically when the stack is launched, Docker is busy launching services, and you may experience timeouts).

chrisccoy commented 8 years ago

Not that this helps, but on my laptop which is also running Ubuntu 16.04, there are zero reported errors in the logs. I performed the same restart instructions as @generalhenry and cannot produce the same issue.

ndegory commented 8 years ago

@generalhenry I don't think anybody was able to reproduce this issue, do you still experiment it with Docker 1.12.2?

generalhenry commented 8 years ago

@ndegory Yes, still same issue

generalhenry commented 8 years ago

I also tried extending the timeout

[[inputs.docker]]
  ## Docker Endpoint
  ##   To use TCP, set endpoint = "tcp://[ip]:[port]"
  ##   To use environment variables (ie, docker-machine), set endpoint = "ENV"
  endpoint = "unix:///var/run/docker.sock"
  ## Only collect metrics for these containers, collect all if empty
  container_names = []
  ## Docker collection timeout
  timeout = "18s"
generalhenry commented 8 years ago

I tried manually querying the stats eg curl --unix-socket /var/run/docker.sock http:/containers/registry.1.1wiv2if2xdce96qo79lecwqji/stats the http connection just hangs.

So the issue isn't telegraf, it's something about my local docker.