Open Henkis opened 10 years ago
A little more from the log:
Wed, 27 Aug 2014 06:03:14 GMT
/docker-registry/prod/images/6e66087f3ffe002664507d225d07b6929843c3f0299f5335a70c1727c8833737/layer
2014-08-27 06:03:14,189 DEBUG: Signature:
AWS AKIAIXQJ3M45M46VAQ2A:7gRFO3tyNcihZCHvmLPGn9WWVFI=
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/gevent/greenlet.py", line 327, in run
result = self._run(*self.args, **self.kwargs)
File "/usr/local/lib/python2.7/dist-packages/docker_registry/core/boto.py", line 76, in _fetch_part
boto_key.get_contents_to_file(f, headers={'Range': brange})
File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 1603, in get_contents_to_file
response_headers=response_headers)
File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 1435, in get_file
query_args=None)
File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 1488, in _get_file_internal
for bytes in self:
File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 368, in next
data = self.resp.read(self.BufferSize)
File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 416, in read
return httplib.HTTPResponse.read(self, amt)
File "/usr/lib/python2.7/httplib.py", line 567, in read
s = self.fp.read(amt)
File "/usr/lib/python2.7/socket.py", line 380, in read
data = self._sock.recv(left)
File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 392, in recv
self._wait(self._read_event)
File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 298, in _wait
self.hub.wait(watcher)
File "/usr/local/lib/python2.7/dist-packages/gevent/hub.py", line 341, in wait
result = waiter.get()
File "/usr/local/lib/python2.7/dist-packages/gevent/hub.py", line 568, in get
return self.hub.switch()
File "/usr/local/lib/python2.7/dist-packages/gevent/hub.py", line 331, in switch
return greenlet.switch(self)
timeout: timed out
<Greenlet at 0x7f8bbb7269b0: <bound method ParallelKey._fetch_part of <docker_registry.core.boto.ParallelKey object at 0x7f8bbb2be550>>('/tmp/tmpPPltfO', 1, 13495293, 26990585)> failed with timeout
After a few more pulls i stops responding and nothing is written to the container log.
Client:
2014/08/27 15:53:22 Error pulling image (97a9cc4076fd69d8a35be028a74f0e00b8066555) from 256.31.14.183/server, Failed to download json: Get http://256.31.14.183/v1/images/31f7549bbb3e5ffba93ed7a37ae6cc86fafe92cfd343eca00a788a39c1e57023/json: read tcp 256.31.14.183:80: i/o timeout
root@ip-172-31-7-6:~# ./test.sh
2014/08/27 16:07:14 Error: Invalid Registry endpoint: Get http://256.31.14.183/v1/_ping: read tcp 256.31.14.183:80: i/o timeout
This tells me you are having issues reaching your S3 bucket. Can you look into details on that front? AWS region?
@shin- what do you think?
If it's S3 related I will deploy a boto.cfg with region set and debug enabled, initially a reboot of the registry container helped so I did not blamed S3 connectivity but I'm not entirely sure any longer.
It's currently not possible to set S3 region in the registry config, correct?
Right - it's not possible if you are running it inside a docker container (possibly a gevent version issue). Keep me posted.
I'we hard wired boot against eu-west-1 but I am still seeing the same issues. I noticed in the log today that worker threads reported problems inside the container:
2014-09-01 20:27:14 [108] [WARNING] Worker graceful timeout (pid:108)
2014-09-01 20:27:14,172 WARNING: Worker graceful timeout (pid:108)
2014-09-01 20:27:15 [108] [INFO] Worker exiting (pid: 108)
If it's S3 related, any ides on how to proceed with debug? I'we seen quite a few 404:s in the logs but they are always for an _inprogess file is that in order ?
Another error with current debug:
Client:
b848adeb3e09: Error downloading dependent layers
2014/09/01 22:58:37 Error pulling image (fb469e5e3584041a51bdd35175776033a829e6b6) from 172.31.9.215/adserver, flate: corrupt input before offset 13576377
Server:
Mon, 01 Sep 2014 20:58:25 GMT
/arkivdigital-docker-registry/prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,804 DEBUG: Signature:
AWS AKIAIXQJ3M45M46VAQ2A:oyWlGXzFPB36R4WVfRw7m+esL1g=
2014-09-01 20:58:25,822 DEBUG: path=/prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,822 DEBUG: auth_path=/arkivdigital-docker-registry/prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,822 DEBUG: Method: HEAD
2014-09-01 20:58:25,822 DEBUG: Path: /prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,822 DEBUG: Data:
2014-09-01 20:58:25,822 DEBUG: Headers: {}
2014-09-01 20:58:25,822 DEBUG: Host: arkivdigital-docker-registry.s3.amazonaws.com
2014-09-01 20:58:25,823 DEBUG: Port: 80
2014-09-01 20:58:25,823 DEBUG: Params: {}
2014-09-01 20:58:25,823 DEBUG: Token: None
2014-09-01 20:58:25,823 DEBUG: StringToSign:
HEAD
Mon, 01 Sep 2014 20:58:25 GMT
/arkivdigital-docker-registry/prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,823 DEBUG: Signature:
AWS AKIAIXQJ3M45M46VAQ2A:oyWlGXzFPB36R4WVfRw7m+esL1g=
2014-09-01 20:58:25,834 DEBUG: path=/prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,834 DEBUG: auth_path=/arkivdigital-docker-registry/prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,834 DEBUG: Method: HEAD
2014-09-01 20:58:25,834 DEBUG: Path: /prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,834 DEBUG: Data:
2014-09-01 20:58:25,834 DEBUG: Headers: {'Range': 'bytes=13561477-67820546'}
2014-09-01 20:58:25,834 DEBUG: Host: arkivdigital-docker-registry.s3.amazonaws.com
2014-09-01 20:58:25,834 DEBUG: Port: 80
2014-09-01 20:58:25,834 DEBUG: Params: {}
2014-09-01 20:58:25,835 DEBUG: Token: None
2014-09-01 20:58:25,835 DEBUG: StringToSign:
HEAD
Mon, 01 Sep 2014 20:58:25 GMT
/arkivdigital-docker-registry/prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,835 DEBUG: Signature:
AWS AKIAIXQJ3M45M46VAQ2A:oyWlGXzFPB36R4WVfRw7m+esL1g=
2014-09-01 20:58:25,854 DEBUG: path=/prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,854 DEBUG: auth_path=/arkivdigital-docker-registry/prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,854 DEBUG: Method: GET
2014-09-01 20:58:25,854 DEBUG: Path: /prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,855 DEBUG: Data:
2014-09-01 20:58:25,855 DEBUG: Headers: {}
2014-09-01 20:58:25,855 DEBUG: Host: arkivdigital-docker-registry.s3.amazonaws.com
2014-09-01 20:58:25,855 DEBUG: Port: 80
2014-09-01 20:58:25,855 DEBUG: Params: {}
2014-09-01 20:58:25,856 DEBUG: Token: None
2014-09-01 20:58:25,856 DEBUG: StringToSign:
GET
Mon, 01 Sep 2014 20:58:25 GMT
/secret-docker-registry/prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,856 DEBUG: Signature:
AWS AKIAIXQJ3M45M46VAQ2A:QaHXpfcR+J4yOoAEKiLTHNqW28Y=
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/gevent/greenlet.py", line 327, in run
result = self._run(*self.args, **self.kwargs)
File "/usr/local/lib/python2.7/dist-packages/docker_registry/core/boto.py", line 76, in _fetch_part
boto_key.get_contents_to_file(f, headers={'Range': brange})
File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 1603, in get_contents_to_file
response_headers=response_headers)
File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 1435, in get_file
query_args=None)
File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 1488, in _get_file_internal
for bytes in self:
File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 368, in next
data = self.resp.read(self.BufferSize)
File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 416, in read
return httplib.HTTPResponse.read(self, amt)
File "/usr/lib/python2.7/httplib.py", line 567, in read
s = self.fp.read(amt)
File "/usr/lib/python2.7/socket.py", line 380, in read
data = self._sock.recv(left)
File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 392, in recv
self._wait(self._read_event)
File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 298, in _wait
self.hub.wait(watcher)
File "/usr/local/lib/python2.7/dist-packages/gevent/hub.py", line 341, in wait
result = waiter.get()
File "/usr/local/lib/python2.7/dist-packages/gevent/hub.py", line 568, in get
return self.hub.switch()
File "/usr/local/lib/python2.7/dist-packages/gevent/hub.py", line 331, in switch
return greenlet.switch(self)
timeout: timed out
<Greenlet at 0x7f66e6898910: <bound method ParallelKey._fetch_part of <docker_registry.core.boto.ParallelKey object at 0x7f66e642b990>>('/tmp/tmp36R_mv', 1, 13564110, 27128219)> failed with timeout
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/gevent/greenlet.py", line 327, in run
result = self._run(*self.args, **self.kwargs)
File "/usr/local/lib/python2.7/dist-packages/docker_registry/core/boto.py", line 76, in _fetch_part
boto_key.get_contents_to_file(f, headers={'Range': brange})
File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 1603, in get_contents_to_file
response_headers=response_headers)
File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 1435, in get_file
query_args=None)
File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 1488, in _get_file_internal
for bytes in self:
File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 368, in next
data = self.resp.read(self.BufferSize)
File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 416, in read
return httplib.HTTPResponse.read(self, amt)
File "/usr/lib/python2.7/httplib.py", line 567, in read
s = self.fp.read(amt)
File "/usr/lib/python2.7/socket.py", line 380, in read
data = self._sock.recv(left)
File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 392, in recv
self._wait(self._read_event)
File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 298, in _wait
self.hub.wait(watcher)
File "/usr/local/lib/python2.7/dist-packages/gevent/hub.py", line 341, in wait
result = waiter.get()
File "/usr/local/lib/python2.7/dist-packages/gevent/hub.py", line 568, in get
return self.hub.switch()
File "/usr/local/lib/python2.7/dist-packages/gevent/hub.py", line 331, in switch
return greenlet.switch(self)
timeout: timed out
<Greenlet at 0x7f66e68987d0: <bound method ParallelKey._fetch_part of <docker_registry.core.boto.ParallelKey object at 0x7f66e642b990>>('/tmp/tmp36R_mv', 4, 54256440, 67820545)> failed with timeout
@Henkis the 404 are irrelevant.
What indeed matters is the timeouts reaching your bucket.
Unfortunately, I have little help to offer as far as debugging is concerned. I would go starting a python script from scratch using boto S3 (https://github.com/boto/boto) and query my bucket repeatedly until I trigger the issue, then dig into boto.
Keep me posted on any progress on this.
@Henkis can you try bumping your boto timeout values:
[Boto]
http_socket_timeout = 60
inside boto.cfg
... and report here if that helps?
Thanks.
I have tried adding the socker timeout to my boto.cfg, it doesn't help.
Same problems maybe some more debug output:
[debug] http.go:162 http://55.67.195.101/v1/images/b026380b9b1536f67fc9db483808d236e2aaa02fe7dbf9db3377dfc830941f6a/json -- HEADERS: map[User-Agent:[docker/1.2.0 go/go1.3.1 git-c94034bf8411f: Download complete
[debug] image.go:325 Json string: {{"id":"b026380b9b1536f67fc9db483808d236e2aaa02fe7dbf9db3377dfc830941f6a","parent":"94034bf8411f38a7a280d612bff163da58ea5b622e7695b6c62edffb001ee1ab","created":"2014-08-26T19:40:30.559806204Z","container":"9328ff054fc27a5163c1ac44bfcb026380b9b15: Pulling fs layer
ainname":"","User":"","Memory":0,"MemorySwap":0,"CpuShares":0,"AttachStdin":false,"AttachStdout":false,"AttachStderr":false,"PortSpecs":null,"ExposedPorts":{"443/tcp":{},"80/tcp"b026380b9b15: Downloading 40.68 MB/67.82 MB 1s
b026380b9b15: Error pulling image (59c2ef25615aee308a641a79d79547445b6a2831) from 55.67.195.101/server, flate: corrupt input before offset 40690609 orrupt input before offset 40690609 7] -job pull(55.67.195.101/server, 59c2ef25615aee308a641a79d79547445b6a2831) = ERR (1)le install --gemfile /home/app/webapp/Gemfile --path /home/app/bundle --deploymentb026380b9b15: Error downloading dependent layers
assets:precompile; cp config/database.production.yml config/database.yml"],"Dns":null,"Image":"94034bf8411f38a7a280d612bff163da58ea5b622e7695b6c62edffb001ee1ab","Volumes":null,"VolumesFrom":"","WorkingDir":"","Entrypoint":null,"NetworkDisabled":false,"OnBuild":[]},"docker_version":"0.9.1","config":{"Hostname":"95699cec7d3e","Domainname":"","User":"","Memory":0,"MemorySwap":0,"CpuShares":0,"AttachStdin":false,"AttachStdout":false,"AttachStderr":false,"PortSpecs":null,"ExposedPorts":{"443/tcp":{},"80/tcp":{}},"Tty":false,"OpenStdin":false,"StdinOnce":false,"Env":["HOME=/root","PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"],"Cmd":["/sbin/my_init"],"Dns":null,"Image":"94034bf8411f38a7a280d612bff163da58ea5b622e7695b6c62edffb001ee1ab","Volumes":null,"VolumesFrom":"","WorkingDir":"","Entrypoint":null,"NetworkDisabled":false,"OnBuild":[]},"architecture":"amd64","os":"linux","Size":69388467}}
[debug] http.go:162 http://55.67.195.101/v1/images/b026380b9b1536f67fc9db483808d236e2aaa02fe7dbf9db3377dfc830941f6a/layer -- HEADERS: map[User-Agent:[docker/1.2.0 go/go1.3.1 git-commit/fa7b24f kernel/3.13.0-24-generic os/linux arch/amd64]]
[debug] session.go:188 server supports resume
[debug] image.go:97 Start untar layer
[debug] archive.go:88 [tar autodetect] n: [31 139 8 0 0 9 110 136 0 255]
[info] encountered error during pull and clearing it before resume: unexpected EOF
2014/09/14 14:09:54 Error pulling image (59c2ef25615aee308a641a79d79547445b6a2831) from 55.67.195.101/server, flate: corrupt input before offset 40690609
Ok, the greenlet is timing out actually. That might be a gunicorn bug you are hitting.
By any chance, would you be able to run of master? (I know there are quite some changes...) - or better, force gunicorn to 19.1 instead of 18.0 (inside requirements/main.txt).
Thanks!
Hi,
I have tried your solution by forcing gunicorn to 19.1, and I still have the same problem but a different error. First error was just like Henkis : the greenlet was timing out.
Now I have that:
Sep 23 10:15:59 docker[23048]: 2014-09-23 10:15:59 [18] [ERROR] Error handling request
Sep 23 10:15:59 docker[23048]: Traceback (most recent call last):
Sep 23 10:15:59 docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gunicorn/workers/async.py", line 108, in handle_request
Sep 23 10:15:59 docker[23048]: resp.write(item)
Sep 23 10:15:59 docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gunicorn/http/wsgi.py", line 344, in write
Sep 23 10:15:59 docker[23048]: util.write(self.sock, arg, self.chunked)
Sep 23 10:15:59 docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gunicorn/util.py", line 301, in write
Sep 23 10:15:59 docker[23048]: sock.sendall(data)
Sep 23 10:15:59 docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 458, in sendall
Sep 23 10:15:59 docker[23048]: data_sent += self.send(_get_memory(data, data_sent), flags)
Sep 23 10:15:59 docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 443, in send
Sep 23 10:15:59 docker[23048]: return sock.send(data, flags)
Sep 23 10:15:59 docker[23048]: error: [Errno 104] Connection reset by peer
Sep 23 10:16:10 docker[23048]: 2014-09-23 10:16:10 [19] [ERROR] Error handling request
Sep 23 10:16:10 docker[23048]: Traceback (most recent call last):
Sep 23 10:16:10 docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gunicorn/workers/async.py", line 108, in handle_request
Sep 23 10:16:10 docker[23048]: resp.write(item)
Sep 23 10:16:10 docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gunicorn/http/wsgi.py", line 344, in write
Sep 23 10:16:10 docker[23048]: util.write(self.sock, arg, self.chunked)
Sep 23 10:16:10 docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gunicorn/util.py", line 301, in write
Sep 23 10:16:10 docker[23048]: sock.sendall(data)
Sep 23 10:16:10 docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 458, in sendall
Sep 23 10:16:10 docker[23048]: data_sent += self.send(_get_memory(data, data_sent), flags)
Sep 23 10:16:10 docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 443, in send
Sep 23 10:16:10 docker[23048]: return sock.send(data, flags)
Sep 23 10:16:10 docker[23048]: error: [Errno 32] Broken pipe
or that :
Sep 23 10:35:08 docker[23048]: 2014-09-23 10:35:08 [16] [ERROR] Error handling request
Sep 23 10:35:08 docker[23048]: Traceback (most recent call last):
Sep 23 10:35:08 docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gunicorn/workers/async.py", line 108, in handle_request
Sep 23 10:35:08 docker[23048]: resp.write(item)
Sep 23 10:35:08 docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gunicorn/http/wsgi.py", line 344, in write
Sep 23 10:35:08 docker[23048]: util.write(self.sock, arg, self.chunked)
Sep 23 10:35:08 docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gunicorn/util.py", line 301, in write
Sep 23 10:35:08 docker[23048]: sock.sendall(data)
Sep 23 10:35:08 docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 458, in sendall
Sep 23 10:35:08 docker[23048]: data_sent += self.send(_get_memory(data, data_sent), flags)
Sep 23 10:35:08 docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 435, in send
Sep 23 10:35:08 docker[23048]: return sock.send(data, flags)
Sep 23 10:35:08 docker[23048]: error: [Errno 104] Connection reset by peer
At least the greenlet is no longer going in lalaland.
So:
This smells to me like a known regression where cancelled requests wouldn't be handled correctly.
Assuming you are indeed behind nginx, can you try killing keep-alive? (eg: proxy_set_header Connection "";
)
Right now no nginx, we acces the registry directly on local network on port 5000. After these errors, the docker pull command just fail and thats it after multiple times it works...
@mcadam that's likely this: https://github.com/benoitc/gunicorn/issues/818
I've been experiencing exactly the same symptoms and it appears the problem, at least in my case, is in fact downloading from S3. I created a tiny boto script to download a 69M layer:
#!/usr/bin/python
from boto.s3.connection import S3Connection, Location
conn = S3Connection('***', '***')
bucket = conn.get_bucket('link-docker')
key = bucket.get_key('test/images/d497ad3926c8997e1e0de74cdd5285489bb2c4acd6db15292e04bbab07047cd0/layer')
key.get_contents_to_filename("test_layer")
and found it took 4s to run the first time, then 15s, 23s, 9s and finally after 2m3s:
Traceback (most recent call last):
...
File "/usr/lib/python2.7/dist-packages/boto/s3/key.py", line 1471, in _get_file_internal
for bytes in self:
File "/usr/lib/python2.7/dist-packages/boto/s3/key.py", line 365, in next
data = self.resp.read(self.BufferSize)
File "/usr/lib/python2.7/dist-packages/boto/connection.py", line 415, in read
return httplib.HTTPResponse.read(self, amt)
File "/usr/lib/python2.7/httplib.py", line 567, in read
s = self.fp.read(amt)
File "/usr/lib/python2.7/socket.py", line 380, in read
data = self._sock.recv(left)
File "/usr/lib/python2.7/ssl.py", line 341, in recv
return self.read(buflen)
File "/usr/lib/python2.7/ssl.py", line 260, in read
return self._sslobj.read(len)
socket.error: [Errno 104] Connection reset by peer
Presumably this kind of unreliability with S3 isn't normal so I'm investigating further...
Same issue here:
b1b7444f3af2: Error downloading dependent layers
2014/10/29 21:09:51 Error pulling image (latest) from myregistry.com:5000/img-myimage, flate: corrupt input before offset 86796209
Restarting does not solve the problem.
We're having similar problems with 0.9.0. Everything seems to be working fine when just one or two machines are pulling at the same time, but when we let our frontend array (30+ machines) pull at the same time then all hell breaks loose.
Our workaround has been putting a haproxy in front and limiting that the haproxy allows only as many connections to the registry as there are worker threads running and to spread out the image pulls over a larger time period. We're also running registry in three different machines, all using the same S3 backend so that the haproxy has more backends to spread the requests out.
@garo and others
Thanks! The storage_redirect seems to fix all our issues. To others: Just add "storage_redirect: true" to the registry.yaml. You can verify that it works by using tcp dump: "tcpdump -l -A -s 1500 src port 5000 |grep Location:" and you should get nice headers like: "Location: https://yourbucket.s3.amazonaws.com/..."
To add to @garo 's comment (and to save you some digging) you can add -e STORAGE_REDIRECT=true
to your docker command if you are running this in a container.
We've got this issue too on our registries. We're running them inside an AWS VPC. We're running one registry container per host, with 6 instance currently. Those instances are fronted by an elastic load balancer whose timeout it set to 5 mins. We're using S3 as a storage backend.
After finding this thread, we set the registry with the -e STORAGE_REDIRECT=true option to delegate image downloads directly to S3. That has helped tremendously with the myriad of errors we were getting such as EOF.
However, we're still getting EOF errors on some registry calls that aren't actual image layer retrievals. For example, I saw an EOF error on a /ancestry call. I don't see any errors in the Docker Registry logs when these types of issues happen so I'm sort of at a loss.
These errors seem to happen when we have a heavier load on the registries such as 5-10 images pulling from the registries from the same time. However, that doesn't seem like a heavy load especially with storage redirect and 6 instances of the registry behind a load balancer.
@all we recently removed ParallelKey fetch from S3 from the registry - it was triggering timeouts for large objects, and cluttering disk space with orphaned temporary files. It's hard to tell if any of the side-effects you experienced here are linked to it or not, but that removal can sure only help.
Hmm that might help, is there any documentation on how to disable that feature on our registries? I'm using the container so I've usually just had environment variables that map to what's in the YAML config file, but I don't see the option you mentioned in that YAML config file schema.
It's code removal (#961). You can test by using the latest master.
Ok I'm trying the fix out now, I'll have to wait a few days to validate whether our percentage of deploy failures when pulling from the registries goes down.
If anyone else is interested in trying this fix quickly, I have a container built on https://registry.hub.docker.com/u/dsw88/registry/. The tag is 0.9.1-parallelkey-fix. I built the container from master, which appears safe at this point since there haven't been any major code changes since the 0.9.1 stable release on Jan. 8: https://github.com/docker/docker-registry/commits/master
Ok it looks like I'm still seeing the EOF errors after applying the parallel key fix. I don't know yet whether their occurrence has been reduced, since I'll need to wait a few days to have enough real-world data from deploys.
But regardless, it appears that parallel key isn't the ultimate fix for this issue. Any other suggestions? Do you think it would be good to set up a Redis LRU cache? I'm wondering if having that cache would decrease the number of errors since it will need to make fewer round-trips to S3.
I would definitely recommend using the Redis LRU cache. I would also advise to delegate actual bits delivery to either S3 or a Cloudfront view of it (https://github.com/docker/docker-registry/blob/master/ADVANCED.md#s3-storage-with-cloudfront)
Ok so I'm now using the LRU cache and doing S3 storage redirect. We continue to get EOF errors, but we'll see if this at least cuts it down.
I'm still concerned about the underlying issue here, however, as adding a cache is at best masking the problem. I'll try to do some debugging in the registry to find out what's causing this error, but I'm not super familiar with the codebase or technology stack so it'll be slow going.
One of my problems is I can't see any errors in the registry logs when this problem occurs. @dmp42 do you have any suggestions about how I could go about getting enough information about the error to start debugging? The Docker client isn't much help when pulling images because it just says "unexpected EOF".
It appears that the small files like /ancestry and others are still failing periodically for us:
$ docker pull <host_name>/<image_name>:238
Pulling repository <host_name>/<image_name>
2015/03/18 20:44:16 Error pulling image (<tag_name>) from <host_name>/<image_name>,
Get http://<host_name>/v1/images/8a39dc87bd3e270444da2b7316ffcc8f7c2e65f5d91e5a3c3d2bcf17b905a7f6/ancestry: EOF
When I look in the registry logs, it shows that it got that request and even returned a 200, so presumably it thought it returned the image layer correctly:
[18/Mar/2015:20:44:16 +0000] "GET /v1/images/8a39dc87bd3e270444da2b7316ffcc8f7c2e65f5d91e5a3c3d2bcf17b905a7f6/ancestry HTTP/1.1" 200 3196 "-" "docker/1.1.2 go/go1.2.1 git-commit/d84a070 kernel/3.14.0-0.bpo.2-amd64 os/linux arch/amd64"
So are the threads in the webapp dying or timing out while it's streaming the response or something?
@dsw88 I now think you are not talking about the same issue that was initially described here (the recommended workarounds were/are here to address registry side EOF/issues communicating with S3). Your problem is more likely on the client side pipeline.
Do you use an http proxy? or a reverse-proxy in front of your registry? (nginx, HAproxy) Do you use boot2docker?
Also, you are running a quite old version of docker (1.1.2).
I would strongly suggest upgrading if you can...
@dmp42 Sorry about that, I'll open a new issue related to the EOF issues we're seeing even though we've already implemented the S3 storage redirect. I'll post there about the details of my setup.
I'm seeing errors similar to the ones mentioned in this comment:
2015-03-27 20:35:14 [685] [ERROR] Error handling request
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/gunicorn/workers/async.py", line 108, in handle_request
resp.write(item)
File "/usr/local/lib/python2.7/dist-packages/gunicorn/http/wsgi.py", line 344, in write
util.write(self.sock, arg, self.chunked)
File "/usr/local/lib/python2.7/dist-packages/gunicorn/util.py", line 301, in write
sock.sendall(data)
File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 458, in sendall
data_sent += self.send(_get_memory(data, data_sent), flags)
File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 435, in send
return sock.send(data, flags)
error: [Errno 32] Broken pipe
I am running docker-registry (registry:latest Image ID: e33e81d7024c
, "Created": "2015-03-20T22:14:39.683558078Z"
) behind trinitronx/nginx-proxy (trinitronx/nginx-proxy:latest
Image ID: 95cc04d9d18e
, "Created": "2014-12-18T23:08:25.909249961Z"
). My nginx-proxy container is simply based off of jwilder/nginx-proxy
except I added client_max_body_size 0;
to the server {
block to avoid silent docker push
failures on large image pushes.
However, the only reason that I'm running the nginx-proxy
is to handle SSL, and when I run docker-registry
with storage_redirect: true
it causes SSL validation errors because docker daemon can't validate the SSL cert when it gets 302
redirected to *.amazonaws.net
.
The error I see in /var/log/docker
is:
Error pulling image (version_676) from registry.example.com:1234/example/web_container, Server error: Status 0 wh
ile fetching image layer (235056cffc249f5f3bfa5ba4425231ab2cf813876a43e1bc05737910c41c53f2)
[2a81fb14] -job pull(registry.example.com:1234/example/web_container, version_123) = ERR (1)
This seems to come from one of these lines in docker's source code:
registry/session.go
180: return nil, fmt.Errorf("Server error: Status %d while fetching image layer (%s)",
191: return nil, fmt.Errorf("Server error: Status %d while fetching image layer (%s)",
The reason I think this is an SSL verification issue is because we tested adding --insecure-registry=registry.example.com:1234
to our docker daemon startup command line, and the error went away.
To see what was happening with the requests, I manually tried running: curl -v -s http://registry.example.com:1234/v1/images/235056cffc249f5f3bfa5ba4425231ab2cf813876a43e1bc05737910c41c53f2/layer
and got a 302
redirect which looked something like this (actual request scrubbed for privacy):
* Hostname was NOT found in DNS cache
* Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 5000 (#0)
> GET /v1/images/235056cffc249f5f3bfa5ba4425231ab2cf813876a43e1bc05737910c41c53f2/layer HTTP/1.1
> User-Agent: curl/7.36.0
> Host: registry.example.com:1234
> Accept: */*
>
< HTTP/1.1 302 FOUND
* Server gunicorn/19.1.1 is not blacklisted
< Server: gunicorn/19.1.1
< Date: Thu, 02 Apr 2015 23:17:44 GMT
< Connection: keep-alive
< Content-Type: text/html; charset=utf-8
< Content-Length: 681
< Location: https://registry.example.com.s3.amazonaws.com/images/235056cffc249f5f3bfa5ba4425231ab2cf813876a43e1bc05737910c41c53f2/layer?Signature=Scrubbed1234VRyRanD0mStr1ng%3D&Expires=1428017864&AWSAccessKeyId=AKIAWSACCESSKEYID
When I manually followed the redirect by hand by running curl -v -s 'https://registry.example.com.s3.amazonaws.com/images/235056cffc249f5f3bfa5ba4425231ab2cf813876a43e1bc05737910c41c53f2/layer?Signature=Scrubbed1234VRyRanD0mStr1ng%3D&Expires=1428017864&AWSAccessKeyId=AKIAWSACCESSKEYID'
I saw this:
* About to connect() to registry.example.com.s3.amazonaws.com port 443 (#0)
* Trying 54.123.34.56... connected
* Connected to registry.example.com.s3.amazonaws.com (54.123.34.56) port 443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* CAfile: /etc/pki/tls/certs/ca-bundle.crt
CApath: none
* SSL: certificate subject name '*.s3.amazonaws.com' does not match target host name 'registry.example.com.s3.amazonaws.com'
* NSS error -12276
* Closing connection #0
So it's a catch-22. I need HTTPS security so I can run docker-registry with the STORAGE_REDIRECT=true
argument (since it seems to pass Location:
headers with AWS temporary S3 keys), but when I run nginx-proxy
with HTTPS (and docker
daemon without --insecure-registry=registry.example.com:1234
, it silently fails due to SSL validation errors. I've verified this by running curl commands against the registry when storage_redirect
is on.
I've tracked my SSL issue down to this problem with S3 bucket names. This seems to have solved the issue with using -e STORAGE_REDIRECT=true
, and allows HTTPS access direct to S3.
To fix my issue I had to:
.
or underscores _
./bin/aws s3 sync s3://registry.example.com s3://registry-bucket-name-without-dots-and-underscores
-e AWS_BUCKET=registry-bucket-name-without-dots-and-underscores
docker-registry
containerNow I no longer see the error: [Errno 32] Broken pipe
or this error:
Error pulling image (version_676) from registry.example.com:1234/example/web_container, Server error: Status 0 wh
ile fetching image layer (235056cffc249f5f3bfa5ba4425231ab2cf813876a43e1bc05737910c41c53f2)
[2a81fb14] -job pull(registry.example.com:1234/example/web_container, version_123) = ERR (1)
Correction.... I still get the Broken pipe
error for requests that the docker-registry
is not giving a 302
response for normally. That is for things like /images/<id>/ancestry
, /images/<id>/json
, etc...
docker-registry
log:
172.17.0.12 - - [09/Apr/2015:23:32:34 +0000] "GET /v1/_ping HTTP/1.1" 200 1603 "-" "Go 1.1 package http"
172.17.0.12 - - [09/Apr/2015:23:32:35 +0000] "GET /v1/_ping HTTP/1.1" 200 1603 "-" "Go 1.1 package http"
172.17.0.12 - - [09/Apr/2015:23:32:35 +0000] "GET /v1/repositories/example/web_container/images HTTP/1.1" 200 97356 "-" "docker/1.3.3 go/go1.3.3 git-commit/d344625 kernel/2.6.32-431.el6.x86_64 os/linux arch/amd64"
172.17.0.12 - - [09/Apr/2015:23:36:35 +0000] "GET /v1/_ping HTTP/1.1" 200 1603 "-" "Go 1.1 package http"
172.17.0.12 - - [09/Apr/2015:23:36:35 +0000] "GET /v1/_ping HTTP/1.1" 200 1603 "-" "Go 1.1 package http"
[2015-04-09 23:36:35 +0000] [65] [ERROR] Error handling request
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/gunicorn/workers/async.py", line 108, in handle_request
resp.write(item)
File "/usr/local/lib/python2.7/dist-packages/gunicorn/http/wsgi.py", line 344, in write
util.write(self.sock, arg, self.chunked)
File "/usr/local/lib/python2.7/dist-packages/gunicorn/util.py", line 301, in write
sock.sendall(data)
File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 458, in sendall
data_sent += self.send(_get_memory(data, data_sent), flags)
File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 435, in send
return sock.send(data, flags)
error: [Errno 32] Broken pipe
172.17.0.12 - - [09/Apr/2015:23:36:35 +0000] "GET /v1/repositories/example/web_container/images HTTP/1.1" 200 97356 "-" "docker/1.3.3 go/go1.3.3 git-commit/d344625 kernel/2.6.32-431.el6.x86_64 os/linux arch/amd64"
[2015-04-09 23:36:38 +0000] [65] [ERROR] Error handling request
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/gunicorn/workers/async.py", line 108, in handle_request
resp.write(item)
File "/usr/local/lib/python2.7/dist-packages/gunicorn/http/wsgi.py", line 344, in write
util.write(self.sock, arg, self.chunked)
File "/usr/local/lib/python2.7/dist-packages/gunicorn/util.py", line 301, in write
sock.sendall(data)
File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 458, in sendall
data_sent += self.send(_get_memory(data, data_sent), flags)
File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 435, in send
return sock.send(data, flags)
error: [Errno 32] Broken pipe
172.17.0.12 - - [09/Apr/2015:23:36:39 +0000] "GET /v1/repositories/example/web_container/tags HTTP/1.1" 200 27959 "-" "docker/1.3.3 go/go1.3.3 git-commit/d344625 kernel/2.6.32-431.el6.x86_64 os/linux arch/amd64"
172.17.0.12 - - [09/Apr/2015:23:36:39 +0000] "GET /v1/images/2b0c589adcfd8d65c643179f0cb243e9ba93c5ae80c0c1b4a5a194d6a868e659/ancestry HTTP/1.1" 200 2176 "-" "docker/1.3.3 go/go1.3.3 git-commit/d344625 kernel/2.6.32-431.el6.x86_64 os/linux arch/amd64"
The failing docker pull command was:
sudo docker pull registry.example.com:443/example/web_container:version_123
Error it gave was:
2015/04/09 17:32:18 Error pulling image (version_123) from registry.example.com:443/example/web_container, Get https://registry.example.com:443/v1/images/2b0c589adcfd8d65c643179f0cb243e9ba93c5ae80c0c1b4a5a194d6a868e659/ancestry: read tcp 123.231.123.45:443: i/o timeout
The docker daemon log /var/log/docker
shows some info, but unfortunately is not logging timestamps for all lines:
[info] POST /v1.15/images/create?fromImage=registry.example.com%3A443%2Fexample%2Fweb_container%3Aversion_123
[55949c70] +job pull(registry.example.com:443/example/web_container, version_123)
Error pulling image (version_123) from registry.example.com:443/example/web_container, Get https://registry.example.com:443/v1/images/2b0c589adcfd8d65c643179f0cb243e9ba93c5ae80c0c1b4a5a194d6a868e659/ancestry: read tcp 123.231.123.45:443: i/o timeout
[55949c70] -job pull(registry.example.com:443/example/web_container, version_123) = ERR (1)
[info] POST /v1.15/images/create?fromImage=registry.example.com%3A443%2Fexample%2Fweb_container%3Aversion_123
[55949c70] +job pull(registry.example.com:443/example/web_container, version_123)
[55949c70] -job execStart(2089a2b3bc9d72b6e6913ab075fa27fe1d79669fbb273eb423f03c84f44288d1) = OK (0)
2015/04/09 17:32:38 http: response.WriteHeader on hijacked connection
[info] GET /v1.15/containers/example_web_1/json
[55949c70] +job container_inspect(example_web_1)
[55949c70] -job container_inspect(example_web_1) = OK (0)
[error] pull.go:175 Get https://registry.example.com:443/v1/repositories/example/web_container/tags: read tcp 12.34.56.78:443: i/o timeout
Get https://registry.example.com:443/v1/repositories/example/web_container/tags: read tcp 12.34.56.78:443: i/o timeout
[55949c70] -job pull(registry.example.com:443/example/web_container, version_123) = ERR (1)
The single line with a timestamp is: 2015/04/09 17:32:38 http: response.WriteHeader on hijacked connection
.
The nginx-proxy
logs show the same requests coming in, with response codes: 200
nginx.1 | 12.34.56.78 - - [09/Apr/2015:23:36:35 +0000] "GET /v1/_ping HTTP/1.1" 200 1603 "-" "Go 1.1 package http" "-"
nginx.1 | registry.example.com 12.34.56.78 - - [09/Apr/2015:23:36:35 +0000] "GET /v1/_ping HTTP/1.1" 200 1603 "-" "Go 1.1 package http"
nginx.1 | 12.34.56.78 - - [09/Apr/2015:23:36:35 +0000] "GET /v1/repositories/example/web_container/images HTTP/1.1" 200 97356 "-" "docker/1.3.3 go/go1.3.3 git-commit/d344625 kernel/2.6.32-431.el6.x86_64 os/linux arch/amd64" "-"
nginx.1 | registry.example.com 12.34.56.78 - - [09/Apr/2015:23:36:35 +0000] "GET /v1/repositories/example/web_container/images HTTP/1.1" 200 97356 "-" "docker/1.3.3 go/go1.3.3 git-commit/d344625 kernel/2.6.32-431.el6.x86_64 os/linux arch/amd64"
nginx.1 | 12.34.56.78 - - [09/Apr/2015:23:36:39 +0000] "GET /v1/repositories/example/web_container/tags HTTP/1.1" 200 27959 "-" "docker/1.3.3 go/go1.3.3 git-commit/d344625 kernel/2.6.32-431.el6.x86_64 os/linux arch/amd64" "-"
nginx.1 | registry.example.com 12.34.56.78 - - [09/Apr/2015:23:36:39 +0000] "GET /v1/repositories/example/web_container/tags HTTP/1.1" 200 27959 "-" "docker/1.3.3 go/go1.3.3 git-commit/d344625 kernel/2.6.32-431.el6.x86_64 os/linux arch/amd64"
nginx.1 | 12.34.56.78 - - [09/Apr/2015:23:36:39 +0000] "GET /v1/images/2b0c589adcfd8d65c643179f0cb243e9ba93c5ae80c0c1b4a5a194d6a868e659/ancestry HTTP/1.1" 200 2176 "-" "docker/1.3.3 go/go1.3.3 git-commit/d344625 kernel/2.6.32-431.el6.x86_64 os/linux arch/amd64" "-"
nginx.1 | registry.example.com 12.34.56.78 - - [09/Apr/2015:23:36:39 +0000] "GET /v1/images/2b0c589adcfd8d65c643179f0cb243e9ba93c5ae80c0c1b4a5a194d6a868e659/ancestry HTTP/1.1" 200 2176 "-" "docker/1.3.3 go/go1.3.3 git-commit/d344625 kernel/2.6.32-431.el6.x86_64 os/linux arch/amd64"
I believe this is still an issue somewhere in the docker-registry
gunicorn app.
The setup is (all IPs, layer ids, and containers anonymized to protect the innocent):
docker daemon (12.34.56.78) --> nginx-proxy (Host IP: 123.231.123.45 container IP: 172.17.0.12) --> docker-registry (container running on same host as nginx-proxy container)
Just noticed in my previous logs that the docker daemon (the client accessing docker-registry to pull images) has the timestamp of the error to be: 2015/04/09 17:32:18
I searched for that time in the logs (17:32:18
) and couldn't find it, so I noticed that clocks on the (docker daemon host) and (docker-registry
/ nginx-proxy
) host are set in different time zones.
I checked the date
on the host running docker daemon and the one running the nginx-proxy
and docker-registry
containers to determine if there was a time offset and prove to myself that the clocks were synchronized correctly via NTP. This will help us in lining up the timing of events. Here are the results:
"docker daemon" host: Thu Apr 9 18:57:44 MDT 2015 = 2015-04-09 18:57:44 -0600 = 2015-04-10 00:57:44 UTC
docker-registry host: Fri Apr 10 00:57:44 UTC 2015 = 2015-04-10 00:57:44 UTC
So both hosts are synchronized to correct time, but are just in different time zones. One is UTC, one is UTC-6 hours. So we just add 6 hours to the docker daemon host time:
'2015/04/09 17:32:18' => 2015-04-09 17:32:18 -0600 => 2015-04-09 23:32:18 UTC
Searching nginx-proxy
logs for 23:32:18
gives 2 matches:
nginx.1 | 12.34.56.78 - - [09/Apr/2015:23:32:18 +0000] "GET /v1/images/2b0c589adcfd8d65c643179f0cb243e9ba93c5ae80c0c1b4a5a194d6a868e659/ancestry HTTP/1.1" 499 0 "-" "docker/1.3.3 go/go1.3.3 git-commit/d344625 kernel/2.6.32-431.el6.x86_64 os/linux arch/amd64" "-"
nginx.1 | registry.example.com 12.34.56.78 - - [09/Apr/2015:23:32:18 +0000] "GET /v1/images/2b0c589adcfd8d65c643179f0cb243e9ba93c5ae80c0c1b4a5a194d6a868e659/ancestry HTTP/1.1" 499 0 "-" "docker/1.3.3 go/go1.3.3 git-commit/d344625 kernel/2.6.32-431.el6.x86_64 os/linux arch/amd64"
Searching docker-registry
logs for 23:32:18
gives NO matches! Something fishy is going on...
So nginx is seeing the request from docker daemon for GET /v1/images/2b0c589adcfd8d65c643179f0cb243e9ba93c5ae80c0c1b4a5a194d6a868e659/ancestry
. It then logs HTTP code 499
.
And nginx shows HTTP Code 499
Making the assumption that Nginx is telling us the truth: the client closed it's connection before it could hear a response. Here is what I think is happening:
sudo docker pull registry.example.com:443/example/web_container:version_123
is run2015-04-09 23:32:18 UTC
: docker daemon performs the request against nginx-proxy
: GET /v1/images/2b0c589adcfd8d65c643179f0cb243e9ba93c5ae80c0c1b4a5a194d6a868e659/ancestry
Error pulling image (version_123) from registry.example.com:443/example/web_container, Get https://registry.example.com:443/v1/images/2b0c589adcfd8d65c643179f0cb243e9ba93c5ae80c0c1b4a5a194d6a868e659/ancestry: read tcp 123.231.123.45:443: i/o timeout
2015-04-09 23:32:18 UTC
: nginx-proxy
logs the request & HTTP Status 499
2015-04-09 23:32:38 UTC
: docker daemon writes log line: 2015/04/09 17:32:38 http: response.WriteHeader on hijacked connection
(Time for log line was: '2015/04/09 17:32:38' => 2015-04-09 17:32:38 -0600 => 2015-04-09 23:32:38 UTC
)2015-04-09 23:36:35 UTC
& 2015-04-09 23:36:38 UTC
: docker-registry
logs the Broken pipe
errors:[2015-04-09 23:36:35 +0000] [65] [ERROR] Error handling request
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/gunicorn/workers/async.py", line 108, in handle_request
resp.write(item)
File "/usr/local/lib/python2.7/dist-packages/gunicorn/http/wsgi.py", line 344, in write
util.write(self.sock, arg, self.chunked)
File "/usr/local/lib/python2.7/dist-packages/gunicorn/util.py", line 301, in write
sock.sendall(data)
File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 458, in sendall
data_sent += self.send(_get_memory(data, data_sent), flags)
File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 435, in send
return sock.send(data, flags)
error: [Errno 32] Broken pipe
172.17.0.12 - - [09/Apr/2015:23:36:35 +0000] "GET /v1/repositories/example/web_container/images HTTP/1.1" 200 97356 "-" "docker/1.3.3 go/go1.3.3 git-commit/d344625 kernel/2.6.32-431.el6.x86_64 os/linux arch/amd64"
[2015-04-09 23:36:38 +0000] [65] [ERROR] Error handling request
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/gunicorn/workers/async.py", line 108, in handle_request
resp.write(item)
File "/usr/local/lib/python2.7/dist-packages/gunicorn/http/wsgi.py", line 344, in write
util.write(self.sock, arg, self.chunked)
File "/usr/local/lib/python2.7/dist-packages/gunicorn/util.py", line 301, in write
sock.sendall(data)
File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 458, in sendall
data_sent += self.send(_get_memory(data, data_sent), flags)
File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 435, in send
return sock.send(data, flags)
error: [Errno 32] Broken pipe
So the docker-registry
throws the final Broken pipe
error at: 2015-04-09 23:36:38 UTC
.
So from the time that the docker daemon makes the request 2015-04-09 23:32:18 UTC
to the time that the error occurs at 2015-04-09 23:36:38 UTC
, this much time has elapsed:
0:04:20 (h:m:s)
260.00 total seconds
4.33 total minutes
0.07 total hours
Given the symptoms, and proposed chain of events... This could potentially related to benoitc/gunicorn#414
The docker-registry
container is running:
root@docker-registry:/# gunicorn -v
gunicorn (version 19.1.1)
And it has these python packages installed:
root@docker-registry:/# pip list
argparse (1.2.1)
backports.lzma (0.0.3)
blinker (1.3)
boto (2.34.0)
bugsnag (2.0.2)
chardet (2.0.1)
colorama (0.2.5)
docker-registry (0.9.1)
docker-registry-core (2.0.3)
Flask (0.10.1)
Flask-Cors (1.10.3)
gevent (1.0.1)
greenlet (0.4.5)
gunicorn (19.1.1)
html5lib (0.999)
itsdangerous (0.24)
Jinja2 (2.7.3)
M2Crypto (0.22.3)
MarkupSafe (0.23)
newrelic (2.22.1.20)
pip (1.5.4)
PyYAML (3.11)
redis (2.10.3)
requests (2.3.0)
setuptools (5.8)
simplejson (3.6.2)
six (1.5.2)
SQLAlchemy (0.9.4)
urllib3 (1.7.1)
WebOb (1.4)
Werkzeug (0.10.1)
wsgiref (0.1.2)
I'm able to reproduce this error easily by running the docker-registry-ui
container:
boot2docker init
boot2docker up
eval $(boot2docker shellinit bash)
# start
docker run -d \
-p 80:80 \
-p 443:443 \
-v /var/run/docker.sock:/tmp/docker.sock \
-v /etc/docker/certs.d/:/etc/nginx/certs \
--name=nginx-proxy \
-t trinitronx/nginx-proxy
# start docker-registry-ui
docker run -d \
-p 8080:8080 \
-e REG1=https://registry.example.com:443/v1/ \
--name=docker-registry-ui
atcol/docker-registry-ui
Watch the logs by opening 3 terminals:
# Terminal 1
docker logs -f nginx-proxy
# Terminal 2
docker logs -f docker-registry
# Terminal 3
docker logs -f docker-registry-ui
I then simply access the "Images" page through the docker-registry-ui
Web UI via URL: http://ip-of-boot2docker-box:8080/repository/index
.
Here are are all 3 anonymized logs
The other way to reproduce it, although not as easy perhaps, is to do lots of docker pull
commands in succession for different images / tagged containers.
Got another instance of this bug, with different request for /repositories/*/*/images
endpoint. However, same 504 Gateway Timeout
, similar stacktrace & error Broken Pipe
from gevent
.
Here are the logs. There are a couple requests in the logs.. the first couple are from the initial docker pull
that failed. The next couple are from 2 manual cURL requests I performed. One reproduced the error, the other didn't. This bug is odd in that it only occurs sometimes, but not always. Usually a second retry is successful. Like I mentioned before, the easiest way to reproduce it is to run the registry UI and just hit some pages to search around for images.
This bug looks a lot like gevent/gevent#445
There are other similar bugs on gevent
's GitHub issue tracker
The ones that most notably seem related:
After enabling the SEARCH_BACKEND
and GUNICORN_OPTS='[--preload]'
options to the docker registry, we ran into stability issues (docker/docker-registry#540).
After disabling both of these options, these stability issues went away and errors became much less frequent (although seemed to be present but still sometimes occurred).
Didn't have enough time to investigate further, but turning off search and disabling the --preload
option did alleviate the issue.
Since this project is deprecated, hopefully this helps anyone still stuck on Registry V1.
We are using a privately hosted registry on amazon which seems to degrade after we have pulled larger images (1GB+) a few times from it, it finely stop responding. A restart of the registry container seems to fix the problem for a while:
And again:
We often see these messages in the log but not always:
We currently use the latest registry (registry 0.8.0 cd3581c06bdc) but have hade the same problems earlier.
Registry config:
Dockerfile:
config.yml: