Closed metakermit closed 8 years ago
Upgrading Docker & docker-compose didn't resolve this, but it did give me some more useful diagnostic messages. Will continue exploring options:
==============================
2015-12-12 00:42:21 Starting containers ...
==============================
Creating blimp_conduit_1
Creating blimp_doveshed_1
Creating blimp_mailpilekermit_1
Creating blimp_musterroll_1
Creating blimp_radicale_1
Creating blimp_mailbox_1
Creating blimp_cockpit_1
ERROR: An HTTP request took too long to complete. Retry with --verbose to obtain debug information.
If you encounter this issue regularly because of slow network conditions, consider setting COMPOSE_HTTP_TIMEOUT to a higher value (current value: 60).
It seems to be a known error when docker-compose is trying to start many containers at the same time. Seems to be a bug in Docker itself. No suggested solution.
Timeouts are not an issue with the client (Compose). All we can do from compose is provide a better error message (which we've done in 1.5.0).
Can we start the containers "manually" via docker
in a known pre-order? Or do we need docker-compose
to actually wire the containers together?
I'm still a little confused on where docker/docker-compose are "coming from", and need to check the versions that I have deployed on my Blimps. Are we using the Go-based implementations?
If we are using Go, I wonder if underlying problems https://github.com/golang/go/issues/12233 could be causing our problems.
Just to report that since the registry.marina.io
"update" of 03-DEC-2015, my Blimp has run start-containers.sh
without the problems that @metakermit has reported here.
We could start things manually, but hooking up containers to each other would also have to be done manually then and a bunch of scripts have to be modified / new ones written. Also, no guarantee that it would work then. Some people on the issues I linked tried increasing the HTTP timeouts to >200s without success. It seems to be an underlying Docker issue where it sometimes just hangs. Open issue unfortunately.
I can reproduce it on RPi2 running Docker 1.8.2 we install from our Ansible scripts, Docker version 1.9.1 I tried manually on the RPi2 and even using the older Docker 1.6.2 I still have on my Cubox.
Docker-compose is still written in Python (I probably remembered something wrong about this rewrite in Go), but it doesn't matter, as the compose devs say that the issue is not up to them. We install it as a dependency of meta-compose right now.
Docker was and still is written in Go. Don't know if the issue you linked is related. I tried changing the maximum number of Go processes Docker can start as some people suggested, but that didn't help.
export GOMAXPROCS=4
We can try changing out certain parameters & configurations, hoping it's after all something we cuased. One thing that strikes me as odd is that it started happening simultaneously on both of my blimps (Cubox & RPi) around Dec 7th. Maybe it's some image update (but I think we haven't changed any of them – I haven't pushed any changes to the blimp-* repositories to master in weeks). Maybe it's something related to btrfs
or the encrypted partition... Perhaps switching to nonencrypted ext4
for /var/lib/docker would solve this. One or two people on the linked issues mentioned they are using btrfs
, but it wasn't happening exclusively on these filesystems. This is my df
output on both Blimps. Mark, is your /var/lib/docker mounted on /dev/mapper/cf-str for sure?
Cubox:
root@kermit-blimp:~# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/root 7571696 5009520 2231692 70% /
devtmpfs 122444 0 122444 0% /dev
tmpfs 131072 384 130688 1% /run
tmpfs 5120 0 5120 0% /run/lock
tmpfs 131072 0 131072 0% /run/shm
tmpfs 1048576 1936 1046640 1% /tmp
ramlog-tmpfs 524288 72816 451472 14% /var/log
cgroup 253684 0 253684 0% /sys/fs/cgroup
/dev/sda1 30508996 44996 28891188 1% /mnt/storage-key
/dev/mapper/cf-str 29029376 2092788 25188268 8% /mnt/storage
/dev/mapper/cf-str 29029376 2092788 25188268 8% /var/lib/docker
/dev/mapper/cf-str 29029376 2092788 25188268 8% /opt/cloudfleet/data
RPi2:
root@kermit-blimp-rpi:~# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/root 2907792 1235968 1504400 46% /
devtmpfs 469380 0 469380 0% /dev
tmpfs 473712 0 473712 0% /dev/shm
tmpfs 473712 640 473072 1% /run
tmpfs 5120 0 5120 0% /run/lock
tmpfs 473712 0 473712 0% /sys/fs/cgroup
/dev/mmcblk0p1 123642 10676 112966 9% /boot/firmware
/dev/sdb1 1881220 2860 1764748 1% /mnt/storage-key
/dev/mapper/cf-str 29029376 2149304 25181320 8% /mnt/storage
/dev/mapper/cf-str 29029376 2149304 25181320 8% /var/lib/docker
/dev/mapper/cf-str 29029376 2149304 25181320 8% /opt/cloudfleet/data
@metakermit asks Mark, is your /var/lib/docker mounted on /dev/mapper/cf-str for sure? Not currently, until I can get past the https://github.com/cloudfleet/blimp-engineroom/issues/2 locally. Right now I am sprinting to get the minimal blimp-parachute implementation stood up, so I need to do that first.
Woohoo! Managed to get this thing rocking. Prolonging the COMPOSE_HTTP_TIMEOUT to two minutes seems to do the trick in our case. Also, even without this, if timeouts occur, it's possible to just hit Docker again and again with another docker-compose up
and after a certain number of retries it eventually gets all the containers up. I added some retry logic in start-containers.sh
in case this ever happens on somebody's Blimp despite the longer timeout.
So essentialy 8a6dcb67a7dc028d94ab5abfaf4badb942ba02e6 closes this for now. I'll reopen it in case I notice similar behaviour in the future.
Another thing I did manually was restart irqbalance which was eating up half of my RAM due to some sort of memory leak. Probably didn't have anything to do with my results, because timeouts were happening even after I did this, but still thought I'd note this.
ok
It seems
docker rm
anddocker start
commands are having problems (invoked bydocker-compose
in thestart-containers.sh
script). Occurred multiple times. Stack traces for the offenders (the other commands work fine):In /opt/cloudfleet/data/logs/blimp-upgrade.log:
Some of the containers start successfully: