crosbymichael / skydock

Service discovery via DNS for docker
MIT License
1.06k stars 88 forks source link

Stops resolving running containers after skydock: EOF under heavy system loads #57

Closed ubergarm closed 10 years ago

ubergarm commented 10 years ago

Executive Summary:

When running a VPS and maxing out RAM and SWAP it might eventually cause skydns+skydock to stop resolving running containers. A manual docker restart skydock will get the services back into skydns and everything resolves happily again after that.

Obvious solution

1) Throw more hardware at it! 2) Limit amount of resources containers can take to prevent system starvation.

Preferred Solution

Create a way for skydock+skydns stack to gracefully recover after being temporarily strangled.

Production Setup

My production box in this case is a 2x CPU, 4GB RAM, 4GB SWAP, Digital Ocean VPS running an apache+php container, a mysql container, a skydock container, a skydns container, and nginx on the host.

Overview

I noticed problems right at the end of the month (as site usage peaks) when I started getting 500 errors which required a manual docker restart skydock to get skydns resolving properly again.

Correlating the logs and metrics led me to observe EOF in the skydock logs right when the 500 errors started with graphs showing high system load.

[error] 1399064111 skydock: EOF
[error] 1399064111 skydock: EOF
[error] 1399064111 skydock: EOF
[error] 1399064111 skydock: EOF

The best reference I could find was from docker irc chat where a rather overloaded VPS system threw the same errors.

The error seems like it is coming from heartbeat() -> updateService(). So perhaps the connection between skydock and skydns craps out allowing the services to time out?

I was able to repeat it two out of two tries on a fresh VPS test install with the latest docker/skydock/skydns stack. Increasing the TTL from 30 up to 300 made the system stay up longer, but eventually it hung long enough to crap out.

Stress testing provided by by starting multiple gitlab containers: thanks Ruby! :)

Details

Steps to reproduce Skydock EOF and loss of registered services from skydns causing failure to resolve container names.

Test Hardware

Digital Ocean Droplet 1x CPU ~512MB RAM ~512MB Swap

Test OS

Ubuntu 12.04.4 LTS (GNU/Linux 3.8.0-29-generic x86_64) Linux skydock-test 3.8.0-29-generic #42~precise1-Ubuntu SMP Wed Aug 14 16:19:23 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

Docker Version

Installed cgroup-lite and lxc-docker from deb https://get.docker.io/ubuntu docker main

Docker Version: Client version: 0.10.0 Client API version: 1.10 Go version (client): go1.2.1 Git commit (client): dc9c28f Server version: 0.10.0 Server API version: 1.10 Git commit (server): dc9c28f Go version (server): go1.2.1 Last stable version: 0.10.0

Docker Daemon OPTS

/etc/init/docker.conf -- DOCKER_OPTS="-dns 172.17.42.1 -bip 172.17.42.1/16"

ulimit -a

ulimit -a as root:

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 3781
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 3781
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Pull Images

$ docker pull crosbymichael/skydns
$ docker pull crosbymichael/skydock
$ docker pull crosbymichael/redis
$ docker pull sameersbn/gitlab:latest

Start Containers

$ docker run -d -p 172.17.42.1:53:53/udp --name skydns crosbymichael/skydns -nameserver 8.8.8.8:53 -domain docker
$ docker run -d -v /var/run/docker.sock:/docker.sock --name skydock crosbymichael/skydock -ttl 10 -environment dev -s /docker.sock -domain docker -name skydns
$ docker run -d --name redis1 crosbymichael/redis
$ docker run -d --name redis2 crosbymichael/redis

Setup hosts resolv.conf

$ sudo sed -i '1s/^/nameserver 172.17.42.1\n/' /etc/resolv.conf

Test Base Config

Everything starts out resolving fine.

$ ping redis.dev.docker    
$ ping redis1.redis.dev.docker    
$ ping redis2.redis.dev.docker    
$ ping skydock.dev.docker    
$ ping skydns.dev.docker    

Introduce Heavy Load

Keep spinning up gitlab containers to thrash the machine until DNS breaks.

$ docker run -d sameersbn/gitlab:latest
...
$ docker run -d sameersbn/gitlab:latest

Skydock logs

[debug] 1399049971 skydock: received event (die) 5226e81f50219a1928f10bd50a9184d54f4b2c2d1e2aaf6503e2ec12c36e0c8f sameersbn/gitlab:latest
[info] 1399049971 skydock: removing 5226e81f50 from skydns
[error] 1399050033 skydock: read tcp 172.17.0.2:8080: connection reset by peer
[error] 1399050034 skydock: EOF
[error] 1399050034 skydock: EOF
[error] 1399050041 skydock: EOF
[debug] 1399050069 skydock: received event (die) 84e06d358662703ddadb7f1835ec7cfce0c317ece05888d625abfc36140b9d90 sameersbn/gitlab:latest
[info] 1399050069 skydock: removing 84e06d3586 from skydns
[debug] 1399050339 skydock: received event (die) 45ad35fad38dad6c56eda5513113fe70dcbff3afa32c61d4a807bdb8924e5f0f sameersbn/gitlab:latest
[info] 1399050339 skydock: removing 45ad35fad3 from skydns

Skydns logs

2014/05/02 16:58:53 Received DNS Request for "redis1.redis.dev.docker." from "172.17.42.1:48430"
2014/05/02 16:58:59 Received DNS Request for "redis1.redis.dev.docker." from "172.17.42.1:56676"
2014/05/02 16:58:59 Calling 0 callback(s) for service 84e06d3586
2014/05/02 16:58:59 Removed Service: 84e06d3586
2014/05/02 16:58:59 Updated Service TTL: 87a190cf50 10
2014/05/02 16:58:59 Updated Service TTL: 4a52804fb8 10
2014/05/02 16:58:59 Updated Service TTL: ccb9859838 10
2014/05/02 16:58:59 Updated Service TTL: 8c65d2828b 10
2014/05/02 16:58:59 Updated Service TTL: 44f871be83 10
2014/05/02 16:58:59 Updated Service TTL: 5226e81f50 10
2014/05/02 16:58:59 Updated Service TTL: 1b16f8703a 10
2014/05/02 16:58:59 Calling 0 callback(s) for service 1b16f8703a
2014/05/02 16:58:59 Removed Service: 1b16f8703a
2014/05/02 16:58:59 Updated Service TTL: 45ad35fad3 10
2014/05/02 16:58:59 Calling 0 callback(s) for service 44f871be83
2014/05/02 16:58:59 Removed Service: 44f871be83
2014/05/02 16:58:59 Calling 0 callback(s) for service 5226e81f50
2014/05/02 16:58:59 Removed Service: 5226e81f50
2014/05/02 16:58:59 Calling 0 callback(s) for service 9233663fd1
2014/05/02 16:58:59 Removed Service: 9233663fd1
2014/05/02 16:59:00 Received DNS Request for "4.0.17.172.in-addr.arpa." from "172.17.42.1:44552"
2014/05/02 16:59:01 Forwarded DNS Request "4.0.17.172.in-addr.arpa." to "8.8.8.8:53"
2014/05/02 16:59:01 Received DNS Request for "4.0.17.172.in-addr.arpa." from "172.17.42.1:47767"
2014/05/02 16:59:01 Forwarded DNS Request "4.0.17.172.in-addr.arpa." to "8.8.8.8:53"
2014/05/02 16:59:02 Updated Service TTL: 45ad35fad3 10
2014/05/02 16:59:02 Received DNS Request for "4.0.17.172.in-addr.arpa." from "172.17.42.1:55471"

2014/05/02 16:59:02 Forwarded DNS Request "4.0.17.172.in-addr.arpa." to "8.8.8.8:53"
2014/05/02 16:59:02 Updated Service TTL: 87a190cf50 10
2014/05/02 16:59:03 Updated Service TTL: 4a52804fb8 10
2014/05/02 16:59:03 Updated Service TTL: ccb9859838 10
2014/05/02 16:59:03 Received DNS Request for "4.0.17.172.in-addr.arpa." from "172.17.42.1:59246"
2014/05/02 16:59:03 Forwarded DNS Request "4.0.17.172.in-addr.arpa." to "8.8.8.8:53"
2014/05/02 16:59:04 Updated Service TTL: 8c65d2828b 10
2014/05/02 16:59:06 Received DNS Request for "redis2.redis.dev.docker." from "172.17.42.1:41331"
2014/05/02 16:59:06 Error:  Service does not exist in registry
2014/05/02 16:59:06 Received DNS Request for "redis2.redis.dev.docker." from "172.17.42.1:55604"
2014/05/02 16:59:06 Error:  Service does not exist in registry
2014/05/02 16:59:08 Received DNS Request for "redis2.redis.dev.docker." from "172.17.42.1:52937"
2014/05/02 16:59:08 Error:  Service does not exist in registry
2014/05/02 16:59:08 Received DNS Request for "redis2.redis.dev.docker." from "172.17.42.1:47870"
2014/05/02 16:59:08 Error:  Service does not exist in registry
2014/05/02 16:59:10 Updated Service TTL: 45ad35fad3 10
2014/05/02 16:59:10 Updated Service TTL: 87a190cf50 10
2014/05/02 16:59:11 Updated Service TTL: 4a52804fb8 10
2014/05/02 16:59:11 Updated Service TTL: ccb9859838 10
2014/05/02 16:59:12 Updated Service TTL: 8c65d2828b 10
2014/05/02 16:59:14 Received DNS Request for "redis2.redis.dev.docker." from "172.17.42.1:47845"
2014/05/02 16:59:14 Error:  Service does not exist in registry
2014/05/02 16:59:14 Received DNS Request for "redis2.redis.dev.docker." from "172.17.42.1:44293"
2014/05/02 16:59:14 Error:  Service does not exist in registry
2014/05/02 16:59:18 Updated Service TTL: 45ad35fad3 10
2014/05/02 16:59:18 Updated Service TTL: 87a190cf50 10
2014/05/02 16:59:19 Updated Service TTL: ccb9859838 10
2014/05/02 16:59:19 Updated Service TTL: 4a52804fb8 10
2014/05/02 16:59:20 Updated Service TTL: 8c65d2828b 10
2014/05/02 16:59:30 Calling 0 callback(s) for service 4a52804fb8
2014/05/02 16:59:31 Removed Service: 4a52804fb8
2014/05/02 16:59:31 Updated Service TTL: 87a190cf50 10
2014/05/02 16:59:31 Updated Service TTL: ccb9859838 10
2014/05/02 16:59:31 Updated Service TTL: 8c65d2828b 10
2014/05/02 16:59:31 Calling 0 callback(s) for service ccb9859838
2014/05/02 16:59:31 Removed Service: ccb9859838
2014/05/02 16:59:31 Calling 0 callback(s) for service 8c65d2828b
2014/05/02 16:59:31 Removed Service: 8c65d2828b
2014/05/02 16:59:31 Updated Service TTL: 45ad35fad3 10
2014/05/02 16:59:31 Calling 0 callback(s) for service 45ad35fad3
2014/05/02 16:59:31 Removed Service: 45ad35fad3
2014/05/02 16:59:31 Calling 0 callback(s) for service 87a190cf50
2014/05/02 16:59:31 Removed Service: 87a190cf50
2014/05/02 16:59:35 Received DNS Request for "redis2.redis.dev.docker." from "172.17.42.1:55834"
2014/05/02 16:59:35 Error:  Service does not exist in registry

skydns goes zombie

$ docker stop skydns  # nothing after a while hit <cntrl><c>
$ docker kill skydns  # nothing, just hangs

Docker Daemon Logs

2014/05/02 15:41:00 POST /v1.10/containers/87a190cf501c/stop?t=10
[/var/lib/docker|8d7c52cd] +job stop(87a190cf501c)
2014/05/02 15:41:10 Container 87a190cf501cdc2c1bd242be9b20ba56aeb2fa26bc0da481e11a23ab34d3a9a5 failed to exit within 10 seconds of SIGTERM - using the force
[/var/lib/docker|8d7c52cd] +job release_interface(87a190cf501cdc2c1bd242be9b20ba56aeb2fa26bc0da481e11a23ab34d3a9a5)
[/var/lib/docker|8d7c52cd] -job release_interface(87a190cf501cdc2c1bd242be9b20ba56aeb2fa26bc0da481e11a23ab34d3a9a5) = OK (0)
2014/05/02 16:14:48 POST /v1.10/containers/skydns/kill?signal=KILL
[/var/lib/docker|8d7c52cd] +job kill(skydns, KILL)
2014/05/02 16:14:58 Container ccb9859838ab failed to exit within 10 seconds of kill - trying direct SIGKILL

ps aux

root      3020  0.1  0.0      0     0 ?        Zsl  12:23   0:20 [skydns] <defunct>

May be related to the kernel version

curl skydns

root@skydock-test:~# curl -X GET -L 172.17.0.2:8080/skydns/services/*
Service does not exist in registry

Repeat Everything with TTL=300 seconds

Same final result but it took more load and time before system hung long enough for problems to occur.

Skydock Log

[debug] 1399064109 skydock: received event (create) 3c0272ebfc8df84d4250b9fbf6a3826e71291b674d46d913ae3fc3930b0e3651 sameersbn/gitlab:latest
[debug] 1399064109 skydock: received event (start) 3c0272ebfc8df84d4250b9fbf6a3826e71291b674d46d913ae3fc3930b0e3651 sameersbn/gitlab:latest
[info] 1399064109 skydock: adding 3c0272ebfc (gitlab) to skydns
[info] 1399064110 skydock: updating ttl for 42482f363d
[info] 1399064110 skydock: updating ttl for a8b74bf97f
[error] 1399064111 skydock: EOF
[error] 1399064111 skydock: EOF
[error] 1399064111 skydock: EOF
[error] 1399064111 skydock: EOF

Skydns Log

2014/05/02 20:52:44 Forwarded DNS Request "4.0.17.172.in-addr.arpa." to "8.8.8.8:53"
2014/05/02 20:52:44 Received DNS Request for "4.0.17.172.in-addr.arpa." from "172.17.42.1:46400"
2014/05/02 20:52:44 Forwarded DNS Request "4.0.17.172.in-addr.arpa." to "8.8.8.8:53"
2014/05/02 20:52:51 Updated Service TTL: 2c1da75ec2 300
2014/05/02 20:52:54 Updated Service TTL: 894c4385c6 300
2014/05/02 20:53:07 Received DNS Request for "daisy.ubuntu.com." from "172.17.42.1:59028"
2014/05/02 20:53:08 Forwarded DNS Request "daisy.ubuntu.com." to "8.8.8.8:53"
2014/05/02 20:53:09 Received DNS Request for "daisy.ubuntu.com." from "172.17.42.1:45508"
2014/05/02 20:53:15 Error: Failure to Forward DNS Request "dial udp 8.8.8.8:53: i/o timeout"
2014/05/02 20:53:16 Received DNS Request for "redis1.redis.dev.docker." from "172.17.42.1:51719"
2014/05/02 20:55:08 Error: Failure to Forward DNS Request "dial udp 8.8.8.8:53: i/o timeout"
2014/05/02 20:55:08 Received DNS Request for "redis1.redis.dev.docker." from "172.17.42.1:45692"
2014/05/02 20:55:08 Received DNS Request for "redis1.redis.dev.docker." from "172.17.42.1:44059"
2014/05/02 20:55:09 Received DNS Request for "4.0.17.172.in-addr.arpa." from "172.17.42.1:39885"
2014/05/02 20:55:09 Received DNS Request for "daisy.ubuntu.com." from "172.17.42.1:38250"
2014/05/02 20:55:09 Forwarded DNS Request "4.0.17.172.in-addr.arpa." to "8.8.8.8:53"
2014/05/02 20:55:09 Forwarded DNS Request "daisy.ubuntu.com." to "8.8.8.8:53"
2014/05/02 20:55:09 Updated Service TTL: a978419b17 300
2014/05/02 20:55:09 Updated Service TTL: c20c19453e 300
2014/05/02 20:55:09 Updated Service TTL: fdda8dd6f2 300
2014/05/02 20:55:09 Updated Service TTL: 206af0ea2f 300
2014/05/02 20:55:09 Received DNS Request for "daisy.ubuntu.com." from "172.17.42.1:56460"
2014/05/02 20:55:09 Calling 0 callback(s) for service fdda8dd6f2
2014/05/02 20:55:09 Removed Service: fdda8dd6f2
2014/05/02 20:55:09 Added Service: {3c0272ebfc gitlab clever_fermi dev  172.17.0.7 22 300 2014-05-02 21:00:09.667536585 +0000 UTC map[]}
2014/05/02 20:55:09 Updated Service TTL: 8c1a3c129c 300
2014/05/02 20:55:09 Updated Service TTL: 198847fca4 300
2014/05/02 20:55:09 Updated Service TTL: 4b10f487ac 300
2014/05/02 20:55:09 Updated Service TTL: ca82643f15 300
2014/05/02 20:55:09 Updated Service TTL: b7dda2baa5 300
2014/05/02 20:55:09 Updated Service TTL: 4b59b2c0d3 300
2014/05/02 20:55:09 Updated Service TTL: 1c05e58b3c 300
2014/05/02 20:55:09 Updated Service TTL: 8b15f1d0b2 300
2014/05/02 20:55:09 Updated Service TTL: a7ae9c16b2 300
2014/05/02 20:55:09 Calling 0 callback(s) for service 2c1da75ec2
2014/05/02 20:55:09 Removed Service: 2c1da75ec2
2014/05/02 20:55:09 Forwarded DNS Request "daisy.ubuntu.com." to "8.8.8.8:53"
2014/05/02 20:55:09 Added Service: {de17559007 gitlab jolly_engelbart dev  172.17.0.21 22 300 2014-05-02 21:00:09.672444668 +0000 UTC map[]}
2014/05/02 20:55:09 Updated Service TTL: c0566e3579 300
2014/05/02 20:55:09 Calling 0 callback(s) for service 206af0ea2f
2014/05/02 20:55:09 Removed Service: 206af0ea2f
2014/05/02 20:55:09 Calling 0 callback(s) for service a978419b17
2014/05/02 20:55:09 Removed Service: a978419b17
2014/05/02 20:55:09 Calling 0 callback(s) for service c20c19453e
2014/05/02 20:55:09 Removed Service: c20c19453e
2014/05/02 20:55:09 Calling 0 callback(s) for service a7ae9c16b2
2014/05/02 20:55:09 Removed Service: a7ae9c16b2
2014/05/02 20:55:09 Calling 0 callback(s) for service c0566e3579
2014/05/02 20:55:09 Removed Service: c0566e3579
2014/05/02 20:55:09 Calling 0 callback(s) for service 8c1a3c129c
2014/05/02 20:55:09 Removed Service: 8c1a3c129c
2014/05/02 20:55:10 Received DNS Request for "daisy.ubuntu.com." from "172.17.42.1:57337"
2014/05/02 20:55:10 Received DNS Request for "4.0.17.172.in-addr.arpa." from "172.17.42.1:51288"
2014/05/02 20:55:10 Forwarded DNS Request "daisy.ubuntu.com." to "8.8.8.8:53"
2014/05/02 20:55:10 Forwarded DNS Request "4.0.17.172.in-addr.arpa." to "8.8.8.8:53"
2014/05/02 20:55:10 Updated Service TTL: 42482f363d 300
2014/05/02 20:55:10 Updated Service TTL: a8b74bf97f 300
ubergarm commented 10 years ago

I attempted to address this in #58

ubergarm commented 10 years ago

Just noticed the 'no-expire' branch and notes at skynetservices/skydns#84 which should just get rid of the whole issue of tweaking timers and error counters. As long as skydock is always running and catches all the notifications from the docker API then skydns should stay in sync.

crosbymichael commented 10 years ago

Yes, skydns is also in the middle of a rewrite to use etcd as the backend so it should be more robust and I will not add any TTL with skydock so that should solve your issue. I just don't know if I should ship skydns, skydock, and etcd in 1 or 3 containers.

ubergarm commented 10 years ago

Thanks for the update. Multi-host will be exciting!

One idea would be to use a single image with all three binaries in it. Then you would instantiate three containers using different names and command line arguments for each entry point.

Pros for this would be: 1) Logs kept separate and not interwoven 2) No need for an inside container process manager watching and restarting on failures. 3) The three binaries' versions could be tested together before vendored out in one package. 4) Possible to use an existing general purpose etcd server possibly already running on the host.

Cons: 1) You still have to start up and monitor 3 separate containers 2) Potential difficulty linking up the first three containers before discovery services exist.

Kind of like how one jar/tar.gz vendors logstash, elasticsearch, and kibana functionality?

Just some thoughts.

arnos commented 10 years ago

I was tinkering around with how to do deployment like these, one idea (based on shipyard) is to how a setup container that he install all the seperate containers you need. - this would also remove any difficulty in linking the initial containers.

You could take that idea one step further by generalizing with a webbased ui for the configuration (eg an MSI for docker).

For the logs would it be possible to extend the concept to use a fourth container mounted drive to hold all the logs of all the containers so that one might use something like logstash-forwarder to go through the logs https://denibertovic.com/post/docker-and-logstash-smarter-log-management-for-your-containers/ ?

For the starting up/monitoring between shipyard, project atomic and perhaps Flynn there are more and more options out there.

On Tuesday, May 6, 2014, ubergarm notifications@github.com wrote:

Thanks for the update. Multi-host will be exciting!

One idea would be to use a single image with all three binaries in it. Then you would instantiate three containers using different names and command line arguments for each entry point.

Pros for this would be: 1) Logs kept separate and not interwoven 2) No need for an inside container process manager watching and restarting on failures. 3) The three binaries' versions could be tested together before vendored out in one package. 4) Possible to use an existing general purpose etcd server possibly already running on the host.

Cons: 1) You still have to start up and monitor 3 separate containers 2) Potential difficulty linking up the first three containers before discovery services exist.

Just some thoughts.

— Reply to this email directly or view it on GitHubhttps://github.com/crosbymichael/skydock/issues/57#issuecomment-42333387 .

pdwinkel commented 10 years ago

I would prefer separate images, so you can use the images of skydns(2) and skydock in combination with CoreOS.

bketelsen commented 10 years ago

+1 for separate images.

ubergarm commented 10 years ago

Regardless of the images consideration, the solution to this original issue was taken care of in the skydock no-expire branch. I built a single image to vendor a solution which takes the form of two running containers (using whatever method you want to start them) here:

https://index.docker.io/u/ubergarm/skyservices/

Hopefully this is agnostic enough to be useful in general, but it works for me and my runit based production system.

I look forward to seeing the new skydns2 and skydock stuff when it is ready! Thanks!