docker-compose errors started appearing

metakermit commented 8 years ago

It seems docker rm and docker start commands are having problems (invoked by docker-compose in the start-containers.sh script). Occurred multiple times. Stack traces for the offenders (the other commands work fine):

In /opt/cloudfleet/data/logs/blimp-upgrade.log:

==================================
2015-12-10 05:52:14  Deleting containers.
==================================
Going to remove blimp_nginx_1, blimp_wellknown_1, blimp_cockpit_1, blimp_mailbox_1, blimp_radicale_1, blimp_musterroll_1, blimp_mailpilekermit_1, blimp_doveshed_1, blimp_conduit_1
Removing blimp_nginx_1... done
Removing blimp_wellknown_1... done
Removing blimp_cockpit_1... done
Removing blimp_mailbox_1... done
Removing blimp_radicale_1...
Removing blimp_musterroll_1...
Removing blimp_mailpilekermit_1...
Removing blimp_doveshed_1...
Removing blimp_conduit_1...
Traceback (most recent call last):
  File "/usr/local/bin/docker-compose", line 9, in <module>
    load_entry_point('docker-compose==1.4.1', 'console_scripts', 'docker-compose')()
  File "/usr/local/lib/python2.7/dist-packages/compose/cli/main.py", line 39, in main
    command.sys_dispatch()
  File "/usr/local/lib/python2.7/dist-packages/compose/cli/docopt_command.py", line 21, in sys_dispatch
    self.dispatch(sys.argv[1:], None)
  File "/usr/local/lib/python2.7/dist-packages/compose/cli/command.py", line 27, in dispatch
    super(Command, self).dispatch(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/compose/cli/docopt_command.py", line 24, in dispatch
    self.perform_command(*self.parse(argv, global_options))
  File "/usr/local/lib/python2.7/dist-packages/compose/cli/command.py", line 59, in perform_command
    handler(project, command_options)
  File "/usr/local/lib/python2.7/dist-packages/compose/cli/main.py", line 268, in rm
    v=options.get('-v', False)
  File "/usr/local/lib/python2.7/dist-packages/compose/project.py", line 232, in remove_stopped
    msg="Removing"
  File "/usr/local/lib/python2.7/dist-packages/compose/utils.py", line 57, in parallel_execute
    raise errors[msg_index]
requests.exceptions.ReadTimeout: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=60)

==============================
2015-12-10 05:54:20 Starting containers ...
==============================
Creating blimp_conduit_1...
Creating blimp_doveshed_1...
Creating blimp_mailpilekermit_1...
Creating blimp_musterroll_1...
Creating blimp_radicale_1...
Creating blimp_mailbox_1...
Creating blimp_cockpit_1...
Traceback (most recent call last):
  File "/usr/local/bin/docker-compose", line 9, in <module>
    load_entry_point('docker-compose==1.4.1', 'console_scripts', 'docker-compose')()
  File "/usr/local/lib/python2.7/dist-packages/compose/cli/main.py", line 39, in main
    command.sys_dispatch()
  File "/usr/local/lib/python2.7/dist-packages/compose/cli/docopt_command.py", line 21, in sys_dispatch
    self.dispatch(sys.argv[1:], None)
  File "/usr/local/lib/python2.7/dist-packages/compose/cli/command.py", line 27, in dispatch
    super(Command, self).dispatch(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/compose/cli/docopt_command.py", line 24, in dispatch
    self.perform_command(*self.parse(argv, global_options))
  File "/usr/local/lib/python2.7/dist-packages/compose/cli/command.py", line 59, in perform_command
    handler(project, command_options)
  File "/usr/local/lib/python2.7/dist-packages/compose/cli/main.py", line 495, in up
    timeout=timeout
  File "/usr/local/lib/python2.7/dist-packages/compose/project.py", line 274, in up
    timeout=timeout
  File "/usr/local/lib/python2.7/dist-packages/compose/service.py", line 377, in execute_convergence_plan
    do_build=do_build,
  File "/usr/local/lib/python2.7/dist-packages/compose/service.py", line 284, in create_container
    return Container.create(self.client, **container_options)
  File "/usr/local/lib/python2.7/dist-packages/compose/container.py", line 42, in create
    response = client.create_container(**options)
  File "/usr/lib/python2.7/dist-packages/docker/client.py", line 245, in create_container
    return self.create_container_from_config(config, name)
  File "/usr/lib/python2.7/dist-packages/docker/client.py", line 252, in create_container_from_config
    res = self._post_json(u, data=config, params=params)
  File "/usr/lib/python2.7/dist-packages/docker/clientbase.py", line 128, in _post_json
    return self._post(url, data=json.dumps(data2), **kwargs)
  File "/usr/lib/python2.7/dist-packages/docker/clientbase.py", line 83, in _post
    return self.post(url, **self._set_request_timeout(kwargs))
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 508, in post
    return self.request('POST', url, data=data, json=json, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 465, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 573, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 433, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=60)

Some of the containers start successfully:

09fc421a06cb        registry.marina.io/cloudfleet/blimp-mailbox      "/bin/sh -c 'python /"   7 hours ago         Up 7 hours          3000/tcp            blimp_mailbox_1
eedc727b1b6c        registry.marina.io/cloudfleet/blimp-radicale     "/bin/sh -c scripts/s"   7 hours ago         Up 7 hours          80/tcp              blimp_radicale_1
b34dc9e3b034        registry.marina.io/cloudfleet/blimp-musterroll   "npm start"              7 hours ago         Up 7 hours          80/tcp, 389/tcp     blimp_musterroll_1
0dae31107519        registry.marina.io/cloudfleet/blimp-mailpile     "/bin/sh -c /opt/clou"   7 hours ago         Up 7 hours          33411/tcp           blimp_mailpilekermit_1
d4f7778c18a5        registry.marina.io/cloudfleet/blimp-doveshed     "/bin/sh -c ./start.s"   7 hours ago         Up 7 hours          25/tcp              blimp_doveshed_1
76347a070656        registry.marina.io/cloudfleet/blimp-conduit      "/bin/sh -c '/usr/bin"   7 hours ago         Up 7 hours          5000/tcp            blimp_conduit_1

metakermit commented 8 years ago

Upgrading Docker & docker-compose didn't resolve this, but it did give me some more useful diagnostic messages. Will continue exploring options:

==============================
2015-12-12 00:42:21 Starting containers ...
==============================
Creating blimp_conduit_1
Creating blimp_doveshed_1
Creating blimp_mailpilekermit_1
Creating blimp_musterroll_1
Creating blimp_radicale_1
Creating blimp_mailbox_1
Creating blimp_cockpit_1
ERROR: An HTTP request took too long to complete. Retry with --verbose to obtain debug information.
If you encounter this issue regularly because of slow network conditions, consider setting COMPOSE_HTTP_TIMEOUT to a higher value (current value: 60).

metakermit commented 8 years ago

It seems to be a known error when docker-compose is trying to start many containers at the same time. Seems to be a bug in Docker itself. No suggested solution.

Timeouts are not an issue with the client (Compose). All we can do from compose is provide a better error message (which we've done in 1.5.0).

easye commented 8 years ago

Can we start the containers "manually" via docker in a known pre-order? Or do we need docker-compose to actually wire the containers together?

easye commented 8 years ago

I'm still a little confused on where docker/docker-compose are "coming from", and need to check the versions that I have deployed on my Blimps. Are we using the Go-based implementations?

If we are using Go, I wonder if underlying problems https://github.com/golang/go/issues/12233 could be causing our problems.

easye commented 8 years ago

Just to report that since the registry.marina.io "update" of 03-DEC-2015, my Blimp has run start-containers.sh without the problems that @metakermit has reported here.

metakermit commented 8 years ago

We could start things manually, but hooking up containers to each other would also have to be done manually then and a bunch of scripts have to be modified / new ones written. Also, no guarantee that it would work then. Some people on the issues I linked tried increasing the HTTP timeouts to >200s without success. It seems to be an underlying Docker issue where it sometimes just hangs. Open issue unfortunately.

I can reproduce it on RPi2 running Docker 1.8.2 we install from our Ansible scripts, Docker version 1.9.1 I tried manually on the RPi2 and even using the older Docker 1.6.2 I still have on my Cubox.

Docker-compose is still written in Python (I probably remembered something wrong about this rewrite in Go), but it doesn't matter, as the compose devs say that the issue is not up to them. We install it as a dependency of meta-compose right now.

Docker was and still is written in Go. Don't know if the issue you linked is related. I tried changing the maximum number of Go processes Docker can start as some people suggested, but that didn't help.

export GOMAXPROCS=4

We can try changing out certain parameters & configurations, hoping it's after all something we cuased. One thing that strikes me as odd is that it started happening simultaneously on both of my blimps (Cubox & RPi) around Dec 7th. Maybe it's some image update (but I think we haven't changed any of them – I haven't pushed any changes to the blimp-* repositories to master in weeks). Maybe it's something related to btrfs or the encrypted partition... Perhaps switching to nonencrypted ext4 for /var/lib/docker would solve this. One or two people on the linked issues mentioned they are using btrfs, but it wasn't happening exclusively on these filesystems. This is my df output on both Blimps. Mark, is your /var/lib/docker mounted on /dev/mapper/cf-str for sure?

Cubox:

root@kermit-blimp:~# df
Filesystem         1K-blocks    Used Available Use% Mounted on
/dev/root            7571696 5009520   2231692  70% /
devtmpfs              122444       0    122444   0% /dev
tmpfs                 131072     384    130688   1% /run
tmpfs                   5120       0      5120   0% /run/lock
tmpfs                 131072       0    131072   0% /run/shm
tmpfs                1048576    1936   1046640   1% /tmp
ramlog-tmpfs          524288   72816    451472  14% /var/log
cgroup                253684       0    253684   0% /sys/fs/cgroup
/dev/sda1           30508996   44996  28891188   1% /mnt/storage-key
/dev/mapper/cf-str  29029376 2092788  25188268   8% /mnt/storage
/dev/mapper/cf-str  29029376 2092788  25188268   8% /var/lib/docker
/dev/mapper/cf-str  29029376 2092788  25188268   8% /opt/cloudfleet/data

RPi2:

root@kermit-blimp-rpi:~# df
Filesystem         1K-blocks    Used Available Use% Mounted on
/dev/root            2907792 1235968   1504400  46% /
devtmpfs              469380       0    469380   0% /dev
tmpfs                 473712       0    473712   0% /dev/shm
tmpfs                 473712     640    473072   1% /run
tmpfs                   5120       0      5120   0% /run/lock
tmpfs                 473712       0    473712   0% /sys/fs/cgroup
/dev/mmcblk0p1        123642   10676    112966   9% /boot/firmware
/dev/sdb1            1881220    2860   1764748   1% /mnt/storage-key
/dev/mapper/cf-str  29029376 2149304  25181320   8% /mnt/storage
/dev/mapper/cf-str  29029376 2149304  25181320   8% /var/lib/docker
/dev/mapper/cf-str  29029376 2149304  25181320   8% /opt/cloudfleet/data

easye commented 8 years ago

@metakermit asks Mark, is your /var/lib/docker mounted on /dev/mapper/cf-str for sure? Not currently, until I can get past the https://github.com/cloudfleet/blimp-engineroom/issues/2 locally. Right now I am sprinting to get the minimal blimp-parachute implementation stood up, so I need to do that first.

metakermit commented 8 years ago

Woohoo! Managed to get this thing rocking. Prolonging the COMPOSE_HTTP_TIMEOUT to two minutes seems to do the trick in our case. Also, even without this, if timeouts occur, it's possible to just hit Docker again and again with another docker-compose up and after a certain number of retries it eventually gets all the containers up. I added some retry logic in start-containers.sh in case this ever happens on somebody's Blimp despite the longer timeout.

So essentialy 8a6dcb67a7dc028d94ab5abfaf4badb942ba02e6 closes this for now. I'll reopen it in case I notice similar behaviour in the future.

Another thing I did manually was restart irqbalance which was eating up half of my RAM due to some sort of memory leak. Probably didn't have anything to do with my results, because timeouts were happening even after I did this, but still thought I'd note this.

dengbin283 commented 7 years ago

ok

cloudfleet / blimp-engineroom

docker-compose errors started appearing #3