docker-archive / docker-registry

This is **DEPRECATED**! Please go to https://github.com/docker/distribution

Apache License 2.0

2.88k stars 879 forks source link

Registry degrades and reports corruption #540

Open Henkis opened 10 years ago

Henkis commented 10 years ago

We are using a privately hosted registry on amazon which seems to degrade after we have pulled larger images (1GB+) a few times from it, it finely stop responding. A restart of the registry container seems to fix the problem for a while:

df29cb2b16e8: Download complete
8673c2be5440: Download complete
efb1cc4f166f: Error downloading dependent layers
2014/08/26 23:09:14 Error pulling image (97a9cc4076fd69d8a35be028a74f0e00b8066555) from 256.31.14.183/server, flate: corrupt input before offset 13576289

And again:

Pulling repository 256.31.14.183/server
a44fa854766a: Error pulling image (97a9cc4076fd69d8a35be028a74f0e00b8066555) from 256.31.14.183/adserver, archive/tar: invalid tar header 1/, archive/tar: invalid tar header
5e66087f3ffe: Error downloading dependent layers
2014/08/27 06:03:14 Error pulling image (97a9cc4076fd69d8a35be028a74f0e00b8066555) from 256.31.14.183/server, archive/tar: invalid tar header

We often see these messages in the log but not always:

<Greenlet at 0x7f8bbb7269b0: <bound method ParallelKey._fetch_part of <docker_registry.core.boto.ParallelKey object at 0x7f8bbb2be550>>('/tmp/tmpPPltfO', 1, 13495293, 26990585)> failed with timeout

We currently use the latest registry (registry 0.8.0 cd3581c06bdc) but have hade the same problems earlier.

Registry config:

Dockerfile:

FROM registry
ADD ./config.yml /docker-registry/config/config.yml

config.yml:

common:
    loglevel: info
    secret_key: <OURLITTLESECRET>

# This is the default configuration when no flavor is specified
dev:
    storage: local
    storage_path: /tmp/registry
    loglevel: debug

# To specify another flavor, set the environment variable SETTINGS_FLAVOR
# $ export SETTINGS_FLAVOR=prod
prod:
    storage: s3
    storage_path: /prod
    loglevel: debug
    # Amazon S3 Storage Configuration
    s3_access_key: <OURLITTLESECRET>
    s3_secret_key: <OURLITTLESECRET>
    s3_bucket: arkivdigital-docker-registry
    boto_bucket: ourrepo-registry
    # Enabling LRU cache for small files. This speeds up read/write on small files
    # when using a remote storage backend (like S3).
    cache:
        host: docker-registryblabla.cache.amazonaws.com
        port: 6379
        db: 0
    cache_lru:
        host: docker-registryblabla.cache.amazonaws.com
        port: 6379
        db: 0

    standalone: true
    disable_token_auth: true

# This flavor is automatically used by unit tests
test:
    storage: local
    storage_path: /tmp/test

Henkis commented 10 years ago

A little more from the log:

Wed, 27 Aug 2014 06:03:14 GMT
/docker-registry/prod/images/6e66087f3ffe002664507d225d07b6929843c3f0299f5335a70c1727c8833737/layer
2014-08-27 06:03:14,189 DEBUG: Signature:
AWS AKIAIXQJ3M45M46VAQ2A:7gRFO3tyNcihZCHvmLPGn9WWVFI=
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/gevent/greenlet.py", line 327, in run
    result = self._run(*self.args, **self.kwargs)
  File "/usr/local/lib/python2.7/dist-packages/docker_registry/core/boto.py", line 76, in _fetch_part
    boto_key.get_contents_to_file(f, headers={'Range': brange})
  File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 1603, in get_contents_to_file
    response_headers=response_headers)
  File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 1435, in get_file
    query_args=None)
  File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 1488, in _get_file_internal
    for bytes in self:
  File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 368, in next
    data = self.resp.read(self.BufferSize)
  File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 416, in read
    return httplib.HTTPResponse.read(self, amt)
  File "/usr/lib/python2.7/httplib.py", line 567, in read
    s = self.fp.read(amt)
  File "/usr/lib/python2.7/socket.py", line 380, in read
    data = self._sock.recv(left)
  File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 392, in recv
    self._wait(self._read_event)
  File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 298, in _wait
    self.hub.wait(watcher)
  File "/usr/local/lib/python2.7/dist-packages/gevent/hub.py", line 341, in wait
    result = waiter.get()
  File "/usr/local/lib/python2.7/dist-packages/gevent/hub.py", line 568, in get
    return self.hub.switch()
  File "/usr/local/lib/python2.7/dist-packages/gevent/hub.py", line 331, in switch
    return greenlet.switch(self)
timeout: timed out
<Greenlet at 0x7f8bbb7269b0: <bound method ParallelKey._fetch_part of <docker_registry.core.boto.ParallelKey object at 0x7f8bbb2be550>>('/tmp/tmpPPltfO', 1, 13495293, 26990585)> failed with timeout

Henkis commented 10 years ago

After a few more pulls i stops responding and nothing is written to the container log.

Client:

2014/08/27 15:53:22 Error pulling image (97a9cc4076fd69d8a35be028a74f0e00b8066555) from 256.31.14.183/server, Failed to download json: Get http://256.31.14.183/v1/images/31f7549bbb3e5ffba93ed7a37ae6cc86fafe92cfd343eca00a788a39c1e57023/json: read tcp 256.31.14.183:80: i/o timeout
root@ip-172-31-7-6:~# ./test.sh
2014/08/27 16:07:14 Error: Invalid Registry endpoint: Get http://256.31.14.183/v1/_ping: read tcp 256.31.14.183:80: i/o timeout

dmp42 commented 10 years ago

This tells me you are having issues reaching your S3 bucket. Can you look into details on that front? AWS region?

@shin- what do you think?

Henkis commented 10 years ago

If it's S3 related I will deploy a boto.cfg with region set and debug enabled, initially a reboot of the registry container helped so I did not blamed S3 connectivity but I'm not entirely sure any longer.

It's currently not possible to set S3 region in the registry config, correct?

dmp42 commented 10 years ago

Right - it's not possible if you are running it inside a docker container (possibly a gevent version issue). Keep me posted.

Henkis commented 10 years ago

I'we hard wired boot against eu-west-1 but I am still seeing the same issues. I noticed in the log today that worker threads reported problems inside the container:

2014-09-01 20:27:14 [108] [WARNING] Worker graceful timeout (pid:108)
2014-09-01 20:27:14,172 WARNING: Worker graceful timeout (pid:108)
2014-09-01 20:27:15 [108] [INFO] Worker exiting (pid: 108)

If it's S3 related, any ides on how to proceed with debug? I'we seen quite a few 404:s in the logs but they are always for an _inprogess file is that in order ?

Another error with current debug:

Client:

b848adeb3e09: Error downloading dependent layers
2014/09/01 22:58:37 Error pulling image (fb469e5e3584041a51bdd35175776033a829e6b6) from 172.31.9.215/adserver, flate: corrupt input before offset 13576377

Server:

Mon, 01 Sep 2014 20:58:25 GMT
/arkivdigital-docker-registry/prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,804 DEBUG: Signature:
AWS AKIAIXQJ3M45M46VAQ2A:oyWlGXzFPB36R4WVfRw7m+esL1g=
2014-09-01 20:58:25,822 DEBUG: path=/prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,822 DEBUG: auth_path=/arkivdigital-docker-registry/prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,822 DEBUG: Method: HEAD
2014-09-01 20:58:25,822 DEBUG: Path: /prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,822 DEBUG: Data:
2014-09-01 20:58:25,822 DEBUG: Headers: {}
2014-09-01 20:58:25,822 DEBUG: Host: arkivdigital-docker-registry.s3.amazonaws.com
2014-09-01 20:58:25,823 DEBUG: Port: 80
2014-09-01 20:58:25,823 DEBUG: Params: {}
2014-09-01 20:58:25,823 DEBUG: Token: None
2014-09-01 20:58:25,823 DEBUG: StringToSign:
HEAD

Mon, 01 Sep 2014 20:58:25 GMT
/arkivdigital-docker-registry/prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,823 DEBUG: Signature:
AWS AKIAIXQJ3M45M46VAQ2A:oyWlGXzFPB36R4WVfRw7m+esL1g=
2014-09-01 20:58:25,834 DEBUG: path=/prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,834 DEBUG: auth_path=/arkivdigital-docker-registry/prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,834 DEBUG: Method: HEAD
2014-09-01 20:58:25,834 DEBUG: Path: /prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,834 DEBUG: Data:
2014-09-01 20:58:25,834 DEBUG: Headers: {'Range': 'bytes=13561477-67820546'}
2014-09-01 20:58:25,834 DEBUG: Host: arkivdigital-docker-registry.s3.amazonaws.com
2014-09-01 20:58:25,834 DEBUG: Port: 80
2014-09-01 20:58:25,834 DEBUG: Params: {}
2014-09-01 20:58:25,835 DEBUG: Token: None
2014-09-01 20:58:25,835 DEBUG: StringToSign:
HEAD

Mon, 01 Sep 2014 20:58:25 GMT
/arkivdigital-docker-registry/prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,835 DEBUG: Signature:
AWS AKIAIXQJ3M45M46VAQ2A:oyWlGXzFPB36R4WVfRw7m+esL1g=
2014-09-01 20:58:25,854 DEBUG: path=/prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,854 DEBUG: auth_path=/arkivdigital-docker-registry/prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,854 DEBUG: Method: GET
2014-09-01 20:58:25,854 DEBUG: Path: /prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,855 DEBUG: Data:
2014-09-01 20:58:25,855 DEBUG: Headers: {}
2014-09-01 20:58:25,855 DEBUG: Host: arkivdigital-docker-registry.s3.amazonaws.com
2014-09-01 20:58:25,855 DEBUG: Port: 80
2014-09-01 20:58:25,855 DEBUG: Params: {}
2014-09-01 20:58:25,856 DEBUG: Token: None
2014-09-01 20:58:25,856 DEBUG: StringToSign:
GET

Mon, 01 Sep 2014 20:58:25 GMT
/secret-docker-registry/prod/images/b747adeb3e09f87206e8e6a9da5c56a126dd0e6475278e79fa3ec38f1366fe6c/layer
2014-09-01 20:58:25,856 DEBUG: Signature:
AWS AKIAIXQJ3M45M46VAQ2A:QaHXpfcR+J4yOoAEKiLTHNqW28Y=
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/gevent/greenlet.py", line 327, in run
    result = self._run(*self.args, **self.kwargs)
  File "/usr/local/lib/python2.7/dist-packages/docker_registry/core/boto.py", line 76, in _fetch_part
    boto_key.get_contents_to_file(f, headers={'Range': brange})
  File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 1603, in get_contents_to_file
    response_headers=response_headers)
  File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 1435, in get_file
    query_args=None)
  File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 1488, in _get_file_internal
    for bytes in self:
  File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 368, in next
    data = self.resp.read(self.BufferSize)
  File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 416, in read
    return httplib.HTTPResponse.read(self, amt)
  File "/usr/lib/python2.7/httplib.py", line 567, in read
    s = self.fp.read(amt)
  File "/usr/lib/python2.7/socket.py", line 380, in read
    data = self._sock.recv(left)
  File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 392, in recv
    self._wait(self._read_event)
  File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 298, in _wait
    self.hub.wait(watcher)
  File "/usr/local/lib/python2.7/dist-packages/gevent/hub.py", line 341, in wait
    result = waiter.get()
  File "/usr/local/lib/python2.7/dist-packages/gevent/hub.py", line 568, in get
    return self.hub.switch()
  File "/usr/local/lib/python2.7/dist-packages/gevent/hub.py", line 331, in switch
    return greenlet.switch(self)
timeout: timed out
<Greenlet at 0x7f66e6898910: <bound method ParallelKey._fetch_part of <docker_registry.core.boto.ParallelKey object at 0x7f66e642b990>>('/tmp/tmp36R_mv', 1, 13564110, 27128219)> failed with timeout

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/gevent/greenlet.py", line 327, in run
    result = self._run(*self.args, **self.kwargs)
  File "/usr/local/lib/python2.7/dist-packages/docker_registry/core/boto.py", line 76, in _fetch_part
    boto_key.get_contents_to_file(f, headers={'Range': brange})
  File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 1603, in get_contents_to_file
    response_headers=response_headers)
  File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 1435, in get_file
    query_args=None)
  File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 1488, in _get_file_internal
    for bytes in self:
  File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 368, in next
    data = self.resp.read(self.BufferSize)
  File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 416, in read
    return httplib.HTTPResponse.read(self, amt)
  File "/usr/lib/python2.7/httplib.py", line 567, in read
    s = self.fp.read(amt)
  File "/usr/lib/python2.7/socket.py", line 380, in read
    data = self._sock.recv(left)
  File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 392, in recv
    self._wait(self._read_event)
  File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 298, in _wait
    self.hub.wait(watcher)
  File "/usr/local/lib/python2.7/dist-packages/gevent/hub.py", line 341, in wait
    result = waiter.get()
  File "/usr/local/lib/python2.7/dist-packages/gevent/hub.py", line 568, in get
    return self.hub.switch()
  File "/usr/local/lib/python2.7/dist-packages/gevent/hub.py", line 331, in switch
    return greenlet.switch(self)
timeout: timed out
<Greenlet at 0x7f66e68987d0: <bound method ParallelKey._fetch_part of <docker_registry.core.boto.ParallelKey object at 0x7f66e642b990>>('/tmp/tmp36R_mv', 4, 54256440, 67820545)> failed with timeout

dmp42 commented 10 years ago

@Henkis the 404 are irrelevant.

What indeed matters is the timeouts reaching your bucket.

Unfortunately, I have little help to offer as far as debugging is concerned. I would go starting a python script from scratch using boto S3 (https://github.com/boto/boto) and query my bucket repeatedly until I trigger the issue, then dig into boto.

Keep me posted on any progress on this.

dmp42 commented 10 years ago

@Henkis can you try bumping your boto timeout values:

[Boto]
http_socket_timeout = 60

inside boto.cfg

... and report here if that helps?

Thanks.

Henkis commented 10 years ago

I have tried adding the socker timeout to my boto.cfg, it doesn't help.

Same problems maybe some more debug output:

[debug] http.go:162 http://55.67.195.101/v1/images/b026380b9b1536f67fc9db483808d236e2aaa02fe7dbf9db3377dfc830941f6a/json -- HEADERS: map[User-Agent:[docker/1.2.0 go/go1.3.1 git-c94034bf8411f: Download complete
[debug] image.go:325 Json string: {{"id":"b026380b9b1536f67fc9db483808d236e2aaa02fe7dbf9db3377dfc830941f6a","parent":"94034bf8411f38a7a280d612bff163da58ea5b622e7695b6c62edffb001ee1ab","created":"2014-08-26T19:40:30.559806204Z","container":"9328ff054fc27a5163c1ac44bfcb026380b9b15: Pulling fs layer
ainname":"","User":"","Memory":0,"MemorySwap":0,"CpuShares":0,"AttachStdin":false,"AttachStdout":false,"AttachStderr":false,"PortSpecs":null,"ExposedPorts":{"443/tcp":{},"80/tcp"b026380b9b15: Downloading 40.68 MB/67.82 MB 1s
b026380b9b15: Error pulling image (59c2ef25615aee308a641a79d79547445b6a2831) from 55.67.195.101/server, flate: corrupt input before offset 40690609 orrupt input before offset 40690609 7] -job pull(55.67.195.101/server, 59c2ef25615aee308a641a79d79547445b6a2831) = ERR (1)le install --gemfile /home/app/webapp/Gemfile --path /home/app/bundle --deploymentb026380b9b15: Error downloading dependent layers
 assets:precompile;    cp config/database.production.yml config/database.yml"],"Dns":null,"Image":"94034bf8411f38a7a280d612bff163da58ea5b622e7695b6c62edffb001ee1ab","Volumes":null,"VolumesFrom":"","WorkingDir":"","Entrypoint":null,"NetworkDisabled":false,"OnBuild":[]},"docker_version":"0.9.1","config":{"Hostname":"95699cec7d3e","Domainname":"","User":"","Memory":0,"MemorySwap":0,"CpuShares":0,"AttachStdin":false,"AttachStdout":false,"AttachStderr":false,"PortSpecs":null,"ExposedPorts":{"443/tcp":{},"80/tcp":{}},"Tty":false,"OpenStdin":false,"StdinOnce":false,"Env":["HOME=/root","PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"],"Cmd":["/sbin/my_init"],"Dns":null,"Image":"94034bf8411f38a7a280d612bff163da58ea5b622e7695b6c62edffb001ee1ab","Volumes":null,"VolumesFrom":"","WorkingDir":"","Entrypoint":null,"NetworkDisabled":false,"OnBuild":[]},"architecture":"amd64","os":"linux","Size":69388467}}
[debug] http.go:162 http://55.67.195.101/v1/images/b026380b9b1536f67fc9db483808d236e2aaa02fe7dbf9db3377dfc830941f6a/layer -- HEADERS: map[User-Agent:[docker/1.2.0 go/go1.3.1 git-commit/fa7b24f kernel/3.13.0-24-generic os/linux arch/amd64]]
[debug] session.go:188 server supports resume
[debug] image.go:97 Start untar layer
[debug] archive.go:88 [tar autodetect] n: [31 139 8 0 0 9 110 136 0 255]
[info] encountered error during pull and clearing it before resume: unexpected EOF
2014/09/14 14:09:54 Error pulling image (59c2ef25615aee308a641a79d79547445b6a2831) from 55.67.195.101/server, flate: corrupt input before offset 40690609

dmp42 commented 10 years ago

Ok, the greenlet is timing out actually. That might be a gunicorn bug you are hitting.

By any chance, would you be able to run of master? (I know there are quite some changes...) - or better, force gunicorn to 19.1 instead of 18.0 (inside requirements/main.txt).

Thanks!

mcadam commented 10 years ago

Hi,

I have tried your solution by forcing gunicorn to 19.1, and I still have the same problem but a different error. First error was just like Henkis : the greenlet was timing out.

Now I have that:

Sep 23 10:15:59  docker[23048]: 2014-09-23 10:15:59 [18] [ERROR] Error handling request
Sep 23 10:15:59  docker[23048]: Traceback (most recent call last):
Sep 23 10:15:59  docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gunicorn/workers/async.py", line 108, in handle_request
Sep 23 10:15:59  docker[23048]: resp.write(item)
Sep 23 10:15:59  docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gunicorn/http/wsgi.py", line 344, in write
Sep 23 10:15:59  docker[23048]: util.write(self.sock, arg, self.chunked)
Sep 23 10:15:59  docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gunicorn/util.py", line 301, in write
Sep 23 10:15:59  docker[23048]: sock.sendall(data)
Sep 23 10:15:59  docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 458, in sendall
Sep 23 10:15:59  docker[23048]: data_sent += self.send(_get_memory(data, data_sent), flags)
Sep 23 10:15:59  docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 443, in send
Sep 23 10:15:59  docker[23048]: return sock.send(data, flags)
Sep 23 10:15:59  docker[23048]: error: [Errno 104] Connection reset by peer
Sep 23 10:16:10  docker[23048]: 2014-09-23 10:16:10 [19] [ERROR] Error handling request
Sep 23 10:16:10  docker[23048]: Traceback (most recent call last):
Sep 23 10:16:10  docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gunicorn/workers/async.py", line 108, in handle_request
Sep 23 10:16:10  docker[23048]: resp.write(item)
Sep 23 10:16:10  docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gunicorn/http/wsgi.py", line 344, in write
Sep 23 10:16:10  docker[23048]: util.write(self.sock, arg, self.chunked)
Sep 23 10:16:10  docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gunicorn/util.py", line 301, in write
Sep 23 10:16:10  docker[23048]: sock.sendall(data)
Sep 23 10:16:10  docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 458, in sendall
Sep 23 10:16:10  docker[23048]: data_sent += self.send(_get_memory(data, data_sent), flags)
Sep 23 10:16:10  docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 443, in send
Sep 23 10:16:10  docker[23048]: return sock.send(data, flags)
Sep 23 10:16:10  docker[23048]: error: [Errno 32] Broken pipe

or that :

Sep 23 10:35:08  docker[23048]: 2014-09-23 10:35:08 [16] [ERROR] Error handling request
Sep 23 10:35:08  docker[23048]: Traceback (most recent call last):
Sep 23 10:35:08  docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gunicorn/workers/async.py", line 108, in handle_request
Sep 23 10:35:08  docker[23048]: resp.write(item)
Sep 23 10:35:08  docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gunicorn/http/wsgi.py", line 344, in write
Sep 23 10:35:08  docker[23048]: util.write(self.sock, arg, self.chunked)
Sep 23 10:35:08  docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gunicorn/util.py", line 301, in write
Sep 23 10:35:08  docker[23048]: sock.sendall(data)
Sep 23 10:35:08  docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 458, in sendall
Sep 23 10:35:08  docker[23048]: data_sent += self.send(_get_memory(data, data_sent), flags)
Sep 23 10:35:08  docker[23048]: File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 435, in send
Sep 23 10:35:08  docker[23048]: return sock.send(data, flags)
Sep 23 10:35:08  docker[23048]: error: [Errno 104] Connection reset by peer

dmp42 commented 10 years ago

At least the greenlet is no longer going in lalaland.

So:

do you experience degradation and ultimately unresponsiveness after these errors?
are you running nginx in front of your registry?

This smells to me like a known regression where cancelled requests wouldn't be handled correctly.

Assuming you are indeed behind nginx, can you try killing keep-alive? (eg: proxy_set_header Connection "";)

mcadam commented 10 years ago

Right now no nginx, we acces the registry directly on local network on port 5000. After these errors, the docker pull command just fail and thats it after multiple times it works...

dmp42 commented 10 years ago

@mcadam that's likely this: https://github.com/benoitc/gunicorn/issues/818

ofrasergreen commented 9 years ago

I've been experiencing exactly the same symptoms and it appears the problem, at least in my case, is in fact downloading from S3. I created a tiny boto script to download a 69M layer:

#!/usr/bin/python
from boto.s3.connection import S3Connection, Location
conn = S3Connection('***', '***')
bucket = conn.get_bucket('link-docker')
key = bucket.get_key('test/images/d497ad3926c8997e1e0de74cdd5285489bb2c4acd6db15292e04bbab07047cd0/layer')
key.get_contents_to_filename("test_layer")

and found it took 4s to run the first time, then 15s, 23s, 9s and finally after 2m3s:

Traceback (most recent call last):
...
  File "/usr/lib/python2.7/dist-packages/boto/s3/key.py", line 1471, in _get_file_internal
    for bytes in self:
  File "/usr/lib/python2.7/dist-packages/boto/s3/key.py", line 365, in next
    data = self.resp.read(self.BufferSize)
  File "/usr/lib/python2.7/dist-packages/boto/connection.py", line 415, in read
    return httplib.HTTPResponse.read(self, amt)
  File "/usr/lib/python2.7/httplib.py", line 567, in read
    s = self.fp.read(amt)  
  File "/usr/lib/python2.7/socket.py", line 380, in read
    data = self._sock.recv(left)
  File "/usr/lib/python2.7/ssl.py", line 341, in recv
    return self.read(buflen)
  File "/usr/lib/python2.7/ssl.py", line 260, in read
    return self._sslobj.read(len)
socket.error: [Errno 104] Connection reset by peer

Presumably this kind of unreliability with S3 isn't normal so I'm investigating further...

rodlogic commented 9 years ago

Same issue here:

b1b7444f3af2: Error downloading dependent layers 
2014/10/29 21:09:51 Error pulling image (latest) from myregistry.com:5000/img-myimage, flate: corrupt input before offset 86796209

Restarting does not solve the problem.

garo commented 9 years ago

We're having similar problems with 0.9.0. Everything seems to be working fine when just one or two machines are pulling at the same time, but when we let our frontend array (30+ machines) pull at the same time then all hell breaks loose.

Our workaround has been putting a haproxy in front and limiting that the haproxy allows only as many connections to the registry as there are worker threads running and to spread out the image pulls over a larger time period. We're also running registry in three different machines, all using the same S3 backend so that the haproxy has more backends to spread the requests out.

dmp42 commented 9 years ago

@garo and others

as a workaround, I would suggest you delegate the layer delivery directly to S3 (see storage_redirect under https://github.com/docker/docker-registry/#general-options - doc sucks but it should be easy to figure out from S3 driver code) - alternatively, you may even use cloudfront directly on top of the S3 bucket and have the registry 302 to CF signed urls (ping me on irc docker-dev if you need help)
this will receive some scrutiny for 1.0 release - now, most of the energy is dedicated to V2 (in go - branch "next-generation"), expected to land Q1, so, I don't want you to expect too much from the current version and would advise you use 1. from above to workaround your actual problem...

garo commented 9 years ago

Thanks! The storage_redirect seems to fix all our issues. To others: Just add "storage_redirect: true" to the registry.yaml. You can verify that it works by using tcp dump: "tcpdump -l -A -s 1500 src port 5000 |grep Location:" and you should get nice headers like: "Location: https://yourbucket.s3.amazonaws.com/..."

trenton42 commented 9 years ago

To add to @garo 's comment (and to save you some digging) you can add -e STORAGE_REDIRECT=true to your docker command if you are running this in a container.

dsw88 commented 9 years ago

We've got this issue too on our registries. We're running them inside an AWS VPC. We're running one registry container per host, with 6 instance currently. Those instances are fronted by an elastic load balancer whose timeout it set to 5 mins. We're using S3 as a storage backend.

After finding this thread, we set the registry with the -e STORAGE_REDIRECT=true option to delegate image downloads directly to S3. That has helped tremendously with the myriad of errors we were getting such as EOF.

However, we're still getting EOF errors on some registry calls that aren't actual image layer retrievals. For example, I saw an EOF error on a /ancestry call. I don't see any errors in the Docker Registry logs when these types of issues happen so I'm sort of at a loss.

These errors seem to happen when we have a heavier load on the registries such as 5-10 images pulling from the registries from the same time. However, that doesn't seem like a heavy load especially with storage redirect and 6 instances of the registry behind a load balancer.

dmp42 commented 9 years ago

@all we recently removed ParallelKey fetch from S3 from the registry - it was triggering timeouts for large objects, and cluttering disk space with orphaned temporary files. It's hard to tell if any of the side-effects you experienced here are linked to it or not, but that removal can sure only help.

dsw88 commented 9 years ago

Hmm that might help, is there any documentation on how to disable that feature on our registries? I'm using the container so I've usually just had environment variables that map to what's in the YAML config file, but I don't see the option you mentioned in that YAML config file schema.

dmp42 commented 9 years ago

It's code removal (#961). You can test by using the latest master.

dsw88 commented 9 years ago

Ok I'm trying the fix out now, I'll have to wait a few days to validate whether our percentage of deploy failures when pulling from the registries goes down.

If anyone else is interested in trying this fix quickly, I have a container built on https://registry.hub.docker.com/u/dsw88/registry/. The tag is 0.9.1-parallelkey-fix. I built the container from master, which appears safe at this point since there haven't been any major code changes since the 0.9.1 stable release on Jan. 8: https://github.com/docker/docker-registry/commits/master

dsw88 commented 9 years ago

Ok it looks like I'm still seeing the EOF errors after applying the parallel key fix. I don't know yet whether their occurrence has been reduced, since I'll need to wait a few days to have enough real-world data from deploys.

But regardless, it appears that parallel key isn't the ultimate fix for this issue. Any other suggestions? Do you think it would be good to set up a Redis LRU cache? I'm wondering if having that cache would decrease the number of errors since it will need to make fewer round-trips to S3.

dmp42 commented 9 years ago

I would definitely recommend using the Redis LRU cache. I would also advise to delegate actual bits delivery to either S3 or a Cloudfront view of it (https://github.com/docker/docker-registry/blob/master/ADVANCED.md#s3-storage-with-cloudfront)

dsw88 commented 9 years ago

Ok so I'm now using the LRU cache and doing S3 storage redirect. We continue to get EOF errors, but we'll see if this at least cuts it down.

I'm still concerned about the underlying issue here, however, as adding a cache is at best masking the problem. I'll try to do some debugging in the registry to find out what's causing this error, but I'm not super familiar with the codebase or technology stack so it'll be slow going.

One of my problems is I can't see any errors in the registry logs when this problem occurs. @dmp42 do you have any suggestions about how I could go about getting enough information about the error to start debugging? The Docker client isn't much help when pulling images because it just says "unexpected EOF".

dsw88 commented 9 years ago

It appears that the small files like /ancestry and others are still failing periodically for us:

$ docker pull <host_name>/<image_name>:238   
Pulling repository <host_name>/<image_name>                      
2015/03/18 20:44:16 Error pulling image (<tag_name>) from <host_name>/<image_name>, 
Get http://<host_name>/v1/images/8a39dc87bd3e270444da2b7316ffcc8f7c2e65f5d91e5a3c3d2bcf17b905a7f6/ancestry: EOF

When I look in the registry logs, it shows that it got that request and even returned a 200, so presumably it thought it returned the image layer correctly:

[18/Mar/2015:20:44:16 +0000] "GET /v1/images/8a39dc87bd3e270444da2b7316ffcc8f7c2e65f5d91e5a3c3d2bcf17b905a7f6/ancestry HTTP/1.1" 200 3196 "-" "docker/1.1.2 go/go1.2.1 git-commit/d84a070 kernel/3.14.0-0.bpo.2-amd64 os/linux arch/amd64"

So are the threads in the webapp dying or timing out while it's streaming the response or something?

dmp42 commented 9 years ago

@dsw88 I now think you are not talking about the same issue that was initially described here (the recommended workarounds were/are here to address registry side EOF/issues communicating with S3). Your problem is more likely on the client side pipeline.

Do you use an http proxy? or a reverse-proxy in front of your registry? (nginx, HAproxy) Do you use boot2docker?

Also, you are running a quite old version of docker (1.1.2).

I would strongly suggest upgrading if you can...

dsw88 commented 9 years ago

@dmp42 Sorry about that, I'll open a new issue related to the EOF issues we're seeing even though we've already implemented the S3 storage redirect. I'll post there about the details of my setup.

trinitronx commented 9 years ago

I'm seeing errors similar to the ones mentioned in this comment:

2015-03-27 20:35:14 [685] [ERROR] Error handling request
Traceback (most recent call last):
 File "/usr/local/lib/python2.7/dist-packages/gunicorn/workers/async.py", line 108, in handle_request
   resp.write(item)
 File "/usr/local/lib/python2.7/dist-packages/gunicorn/http/wsgi.py", line 344, in write
   util.write(self.sock, arg, self.chunked)
 File "/usr/local/lib/python2.7/dist-packages/gunicorn/util.py", line 301, in write
   sock.sendall(data)
 File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 458, in sendall
   data_sent += self.send(_get_memory(data, data_sent), flags)
 File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 435, in send
   return sock.send(data, flags)
error: [Errno 32] Broken pipe

I am running docker-registry (registry:latest Image ID: e33e81d7024c, "Created": "2015-03-20T22:14:39.683558078Z") behind trinitronx/nginx-proxy (trinitronx/nginx-proxy:latest Image ID: 95cc04d9d18e, "Created": "2014-12-18T23:08:25.909249961Z"). My nginx-proxy container is simply based off of jwilder/nginx-proxy except I added client_max_body_size 0; to the server { block to avoid silent docker push failures on large image pushes.

However, the only reason that I'm running the nginx-proxy is to handle SSL, and when I run docker-registry with storage_redirect: true it causes SSL validation errors because docker daemon can't validate the SSL cert when it gets 302 redirected to *.amazonaws.net.

The error I see in /var/log/docker is:

Error pulling image (version_676) from registry.example.com:1234/example/web_container, Server error: Status 0 wh
ile fetching image layer (235056cffc249f5f3bfa5ba4425231ab2cf813876a43e1bc05737910c41c53f2)
[2a81fb14] -job pull(registry.example.com:1234/example/web_container, version_123) = ERR (1)

This seems to come from one of these lines in docker's source code:

registry/session.go
180:                            return nil, fmt.Errorf("Server error: Status %d while fetching image layer (%s)",
191:            return nil, fmt.Errorf("Server error: Status %d while fetching image layer (%s)",

The reason I think this is an SSL verification issue is because we tested adding --insecure-registry=registry.example.com:1234 to our docker daemon startup command line, and the error went away.

To see what was happening with the requests, I manually tried running: curl -v -s http://registry.example.com:1234/v1/images/235056cffc249f5f3bfa5ba4425231ab2cf813876a43e1bc05737910c41c53f2/layer and got a 302 redirect which looked something like this (actual request scrubbed for privacy):

* Hostname was NOT found in DNS cache
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 5000 (#0)
> GET /v1/images/235056cffc249f5f3bfa5ba4425231ab2cf813876a43e1bc05737910c41c53f2/layer HTTP/1.1
> User-Agent: curl/7.36.0
> Host: registry.example.com:1234
> Accept: */*
>
< HTTP/1.1 302 FOUND
* Server gunicorn/19.1.1 is not blacklisted
< Server: gunicorn/19.1.1
< Date: Thu, 02 Apr 2015 23:17:44 GMT
< Connection: keep-alive
< Content-Type: text/html; charset=utf-8
< Content-Length: 681
< Location: https://registry.example.com.s3.amazonaws.com/images/235056cffc249f5f3bfa5ba4425231ab2cf813876a43e1bc05737910c41c53f2/layer?Signature=Scrubbed1234VRyRanD0mStr1ng%3D&Expires=1428017864&AWSAccessKeyId=AKIAWSACCESSKEYID

When I manually followed the redirect by hand by running curl -v -s 'https://registry.example.com.s3.amazonaws.com/images/235056cffc249f5f3bfa5ba4425231ab2cf813876a43e1bc05737910c41c53f2/layer?Signature=Scrubbed1234VRyRanD0mStr1ng%3D&Expires=1428017864&AWSAccessKeyId=AKIAWSACCESSKEYID' I saw this:

* About to connect() to registry.example.com.s3.amazonaws.com port 443 (#0)
*   Trying 54.123.34.56... connected
* Connected to registry.example.com.s3.amazonaws.com (54.123.34.56) port 443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* SSL: certificate subject name '*.s3.amazonaws.com' does not match target host name 'registry.example.com.s3.amazonaws.com'
* NSS error -12276
* Closing connection #0

So it's a catch-22. I need HTTPS security so I can run docker-registry with the STORAGE_REDIRECT=true argument (since it seems to pass Location: headers with AWS temporary S3 keys), but when I run nginx-proxy with HTTPS (and docker daemon without --insecure-registry=registry.example.com:1234, it silently fails due to SSL validation errors. I've verified this by running curl commands against the registry when storage_redirect is on.

trinitronx commented 9 years ago

I've tracked my SSL issue down to this problem with S3 bucket names. This seems to have solved the issue with using -e STORAGE_REDIRECT=true, and allows HTTPS access direct to S3.

To fix my issue I had to:

Copy S3 bucket to new name without dots . or underscores _
./bin/aws s3 sync s3://registry.example.com s3://registry-bucket-name-without-dots-and-underscores
Set -e AWS_BUCKET=registry-bucket-name-without-dots-and-underscores
Restarted docker-registry container

Now I no longer see the error: [Errno 32] Broken pipe or this error:

Error pulling image (version_676) from registry.example.com:1234/example/web_container, Server error: Status 0 wh
ile fetching image layer (235056cffc249f5f3bfa5ba4425231ab2cf813876a43e1bc05737910c41c53f2)
[2a81fb14] -job pull(registry.example.com:1234/example/web_container, version_123) = ERR (1)

trinitronx commented 9 years ago

Oops... I spoke too soon :sob:

Correction.... I still get the Broken pipe error for requests that the docker-registry is not giving a 302 response for normally. That is for things like /images/<id>/ancestry, /images/<id>/json, etc...

docker-registry log:

172.17.0.12 - - [09/Apr/2015:23:32:34 +0000] "GET /v1/_ping HTTP/1.1" 200 1603 "-" "Go 1.1 package http"
172.17.0.12 - - [09/Apr/2015:23:32:35 +0000] "GET /v1/_ping HTTP/1.1" 200 1603 "-" "Go 1.1 package http"
172.17.0.12 - - [09/Apr/2015:23:32:35 +0000] "GET /v1/repositories/example/web_container/images HTTP/1.1" 200 97356 "-" "docker/1.3.3 go/go1.3.3 git-commit/d344625 kernel/2.6.32-431.el6.x86_64 os/linux arch/amd64"
172.17.0.12 - - [09/Apr/2015:23:36:35 +0000] "GET /v1/_ping HTTP/1.1" 200 1603 "-" "Go 1.1 package http"
172.17.0.12 - - [09/Apr/2015:23:36:35 +0000] "GET /v1/_ping HTTP/1.1" 200 1603 "-" "Go 1.1 package http"
[2015-04-09 23:36:35 +0000] [65] [ERROR] Error handling request
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/gunicorn/workers/async.py", line 108, in handle_request
    resp.write(item)
  File "/usr/local/lib/python2.7/dist-packages/gunicorn/http/wsgi.py", line 344, in write
    util.write(self.sock, arg, self.chunked)
  File "/usr/local/lib/python2.7/dist-packages/gunicorn/util.py", line 301, in write
    sock.sendall(data)
  File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 458, in sendall
    data_sent += self.send(_get_memory(data, data_sent), flags)
  File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 435, in send
    return sock.send(data, flags)
error: [Errno 32] Broken pipe
172.17.0.12 - - [09/Apr/2015:23:36:35 +0000] "GET /v1/repositories/example/web_container/images HTTP/1.1" 200 97356 "-" "docker/1.3.3 go/go1.3.3 git-commit/d344625 kernel/2.6.32-431.el6.x86_64 os/linux arch/amd64"
[2015-04-09 23:36:38 +0000] [65] [ERROR] Error handling request
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/gunicorn/workers/async.py", line 108, in handle_request
    resp.write(item)
  File "/usr/local/lib/python2.7/dist-packages/gunicorn/http/wsgi.py", line 344, in write
    util.write(self.sock, arg, self.chunked)
  File "/usr/local/lib/python2.7/dist-packages/gunicorn/util.py", line 301, in write
    sock.sendall(data)
  File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 458, in sendall
    data_sent += self.send(_get_memory(data, data_sent), flags)
  File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 435, in send
    return sock.send(data, flags)
error: [Errno 32] Broken pipe
172.17.0.12 - - [09/Apr/2015:23:36:39 +0000] "GET /v1/repositories/example/web_container/tags HTTP/1.1" 200 27959 "-" "docker/1.3.3 go/go1.3.3 git-commit/d344625 kernel/2.6.32-431.el6.x86_64 os/linux arch/amd64"
172.17.0.12 - - [09/Apr/2015:23:36:39 +0000] "GET /v1/images/2b0c589adcfd8d65c643179f0cb243e9ba93c5ae80c0c1b4a5a194d6a868e659/ancestry HTTP/1.1" 200 2176 "-" "docker/1.3.3 go/go1.3.3 git-commit/d344625 kernel/2.6.32-431.el6.x86_64 os/linux arch/amd64"

The failing docker pull command was:

sudo docker pull registry.example.com:443/example/web_container:version_123

Error it gave was:

2015/04/09 17:32:18 Error pulling image (version_123) from registry.example.com:443/example/web_container, Get https://registry.example.com:443/v1/images/2b0c589adcfd8d65c643179f0cb243e9ba93c5ae80c0c1b4a5a194d6a868e659/ancestry: read tcp 123.231.123.45:443: i/o timeout

The docker daemon log /var/log/docker shows some info, but unfortunately is not logging timestamps for all lines:

[info] POST /v1.15/images/create?fromImage=registry.example.com%3A443%2Fexample%2Fweb_container%3Aversion_123
[55949c70] +job pull(registry.example.com:443/example/web_container, version_123)
Error pulling image (version_123) from registry.example.com:443/example/web_container, Get https://registry.example.com:443/v1/images/2b0c589adcfd8d65c643179f0cb243e9ba93c5ae80c0c1b4a5a194d6a868e659/ancestry: read tcp 123.231.123.45:443: i/o timeout
[55949c70] -job pull(registry.example.com:443/example/web_container, version_123) = ERR (1)
[info] POST /v1.15/images/create?fromImage=registry.example.com%3A443%2Fexample%2Fweb_container%3Aversion_123
[55949c70] +job pull(registry.example.com:443/example/web_container, version_123)
[55949c70] -job execStart(2089a2b3bc9d72b6e6913ab075fa27fe1d79669fbb273eb423f03c84f44288d1) = OK (0)
2015/04/09 17:32:38 http: response.WriteHeader on hijacked connection
[info] GET /v1.15/containers/example_web_1/json
[55949c70] +job container_inspect(example_web_1)
[55949c70] -job container_inspect(example_web_1) = OK (0)
[error] pull.go:175 Get https://registry.example.com:443/v1/repositories/example/web_container/tags: read tcp 12.34.56.78:443: i/o timeout
Get https://registry.example.com:443/v1/repositories/example/web_container/tags: read tcp 12.34.56.78:443: i/o timeout
[55949c70] -job pull(registry.example.com:443/example/web_container, version_123) = ERR (1)

The single line with a timestamp is: 2015/04/09 17:32:38 http: response.WriteHeader on hijacked connection.

The nginx-proxy logs show the same requests coming in, with response codes: 200

nginx.1    | 12.34.56.78 - - [09/Apr/2015:23:36:35 +0000] "GET /v1/_ping HTTP/1.1" 200 1603 "-" "Go 1.1 package http" "-"
nginx.1    | registry.example.com 12.34.56.78 - - [09/Apr/2015:23:36:35 +0000] "GET /v1/_ping HTTP/1.1" 200 1603 "-" "Go 1.1 package http"
nginx.1    | 12.34.56.78 - - [09/Apr/2015:23:36:35 +0000] "GET /v1/repositories/example/web_container/images HTTP/1.1" 200 97356 "-" "docker/1.3.3 go/go1.3.3 git-commit/d344625 kernel/2.6.32-431.el6.x86_64 os/linux arch/amd64" "-"
nginx.1    | registry.example.com 12.34.56.78 - - [09/Apr/2015:23:36:35 +0000] "GET /v1/repositories/example/web_container/images HTTP/1.1" 200 97356 "-" "docker/1.3.3 go/go1.3.3 git-commit/d344625 kernel/2.6.32-431.el6.x86_64 os/linux arch/amd64"
nginx.1    | 12.34.56.78 - - [09/Apr/2015:23:36:39 +0000] "GET /v1/repositories/example/web_container/tags HTTP/1.1" 200 27959 "-" "docker/1.3.3 go/go1.3.3 git-commit/d344625 kernel/2.6.32-431.el6.x86_64 os/linux arch/amd64" "-"
nginx.1    | registry.example.com 12.34.56.78 - - [09/Apr/2015:23:36:39 +0000] "GET /v1/repositories/example/web_container/tags HTTP/1.1" 200 27959 "-" "docker/1.3.3 go/go1.3.3 git-commit/d344625 kernel/2.6.32-431.el6.x86_64 os/linux arch/amd64"
nginx.1    | 12.34.56.78 - - [09/Apr/2015:23:36:39 +0000] "GET /v1/images/2b0c589adcfd8d65c643179f0cb243e9ba93c5ae80c0c1b4a5a194d6a868e659/ancestry HTTP/1.1" 200 2176 "-" "docker/1.3.3 go/go1.3.3 git-commit/d344625 kernel/2.6.32-431.el6.x86_64 os/linux arch/amd64" "-"
nginx.1    | registry.example.com 12.34.56.78 - - [09/Apr/2015:23:36:39 +0000] "GET /v1/images/2b0c589adcfd8d65c643179f0cb243e9ba93c5ae80c0c1b4a5a194d6a868e659/ancestry HTTP/1.1" 200 2176 "-" "docker/1.3.3 go/go1.3.3 git-commit/d344625 kernel/2.6.32-431.el6.x86_64 os/linux arch/amd64"

I believe this is still an issue somewhere in the docker-registry gunicorn app.

The setup is (all IPs, layer ids, and containers anonymized to protect the innocent):

docker daemon (12.34.56.78) --> nginx-proxy (Host IP: 123.231.123.45 container IP: 172.17.0.12) --> docker-registry (container running on same host as nginx-proxy container)

trinitronx commented 9 years ago

Just noticed in my previous logs that the docker daemon (the client accessing docker-registry to pull images) has the timestamp of the error to be: 2015/04/09 17:32:18

I searched for that time in the logs (17:32:18) and couldn't find it, so I noticed that clocks on the (docker daemon host) and (docker-registry / nginx-proxy) host are set in different time zones.

I checked the date on the host running docker daemon and the one running the nginx-proxy and docker-registry containers to determine if there was a time offset and prove to myself that the clocks were synchronized correctly via NTP. This will help us in lining up the timing of events. Here are the results:

"docker daemon" host: Thu Apr  9 18:57:44 MDT 2015 = 2015-04-09 18:57:44 -0600 = 2015-04-10 00:57:44 UTC
docker-registry host: Fri Apr 10 00:57:44 UTC 2015 = 2015-04-10 00:57:44 UTC

So both hosts are synchronized to correct time, but are just in different time zones. One is UTC, one is UTC-6 hours. So we just add 6 hours to the docker daemon host time:

'2015/04/09 17:32:18' => 2015-04-09 17:32:18 -0600 => 2015-04-09 23:32:18 UTC

Searching nginx-proxy logs for 23:32:18 gives 2 matches:

nginx.1    | 12.34.56.78 - - [09/Apr/2015:23:32:18 +0000] "GET /v1/images/2b0c589adcfd8d65c643179f0cb243e9ba93c5ae80c0c1b4a5a194d6a868e659/ancestry HTTP/1.1" 499 0 "-" "docker/1.3.3 go/go1.3.3 git-commit/d344625 kernel/2.6.32-431.el6.x86_64 os/linux arch/amd64" "-"
nginx.1    | registry.example.com 12.34.56.78 - - [09/Apr/2015:23:32:18 +0000] "GET /v1/images/2b0c589adcfd8d65c643179f0cb243e9ba93c5ae80c0c1b4a5a194d6a868e659/ancestry HTTP/1.1" 499 0 "-" "docker/1.3.3 go/go1.3.3 git-commit/d344625 kernel/2.6.32-431.el6.x86_64 os/linux arch/amd64"

Searching docker-registry logs for 23:32:18 gives NO matches! Something fishy is going on...

So nginx is seeing the request from docker daemon for GET /v1/images/2b0c589adcfd8d65c643179f0cb243e9ba93c5ae80c0c1b4a5a194d6a868e659/ancestry. It then logs HTTP code 499.

And nginx shows HTTP Code 499

Timeline

Making the assumption that Nginx is telling us the truth: the client closed it's connection before it could hear a response. Here is what I think is happening:

sudo docker pull registry.example.com:443/example/web_container:version_123 is run
2015-04-09 23:32:18 UTC: docker daemon performs the request against nginx-proxy: GET /v1/images/2b0c589adcfd8d65c643179f0cb243e9ba93c5ae80c0c1b4a5a194d6a868e659/ancestry
docker daemon disconnects for some reason with error:
Error pulling image (version_123) from registry.example.com:443/example/web_container, Get https://registry.example.com:443/v1/images/2b0c589adcfd8d65c643179f0cb243e9ba93c5ae80c0c1b4a5a194d6a868e659/ancestry: read tcp 123.231.123.45:443: i/o timeout
2015-04-09 23:32:18 UTC: nginx-proxy logs the request & HTTP Status 499
2015-04-09 23:32:38 UTC: docker daemon writes log line: 2015/04/09 17:32:38 http: response.WriteHeader on hijacked connection (Time for log line was: '2015/04/09 17:32:38' => 2015-04-09 17:32:38 -0600 => 2015-04-09 23:32:38 UTC)
2015-04-09 23:36:35 UTC & 2015-04-09 23:36:38 UTC: docker-registry logs the Broken pipe errors:

[2015-04-09 23:36:35 +0000] [65] [ERROR] Error handling request
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/gunicorn/workers/async.py", line 108, in handle_request
    resp.write(item)
  File "/usr/local/lib/python2.7/dist-packages/gunicorn/http/wsgi.py", line 344, in write
    util.write(self.sock, arg, self.chunked)
  File "/usr/local/lib/python2.7/dist-packages/gunicorn/util.py", line 301, in write
    sock.sendall(data)
  File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 458, in sendall
    data_sent += self.send(_get_memory(data, data_sent), flags)
  File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 435, in send
    return sock.send(data, flags)
error: [Errno 32] Broken pipe
172.17.0.12 - - [09/Apr/2015:23:36:35 +0000] "GET /v1/repositories/example/web_container/images HTTP/1.1" 200 97356 "-" "docker/1.3.3 go/go1.3.3 git-commit/d344625 kernel/2.6.32-431.el6.x86_64 os/linux arch/amd64"
[2015-04-09 23:36:38 +0000] [65] [ERROR] Error handling request
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/gunicorn/workers/async.py", line 108, in handle_request
    resp.write(item)
  File "/usr/local/lib/python2.7/dist-packages/gunicorn/http/wsgi.py", line 344, in write
    util.write(self.sock, arg, self.chunked)
  File "/usr/local/lib/python2.7/dist-packages/gunicorn/util.py", line 301, in write
    sock.sendall(data)
  File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 458, in sendall
    data_sent += self.send(_get_memory(data, data_sent), flags)
  File "/usr/local/lib/python2.7/dist-packages/gevent/socket.py", line 435, in send
    return sock.send(data, flags)
error: [Errno 32] Broken pipe

So the docker-registry throws the final Broken pipe error at: 2015-04-09 23:36:38 UTC.

So from the time that the docker daemon makes the request 2015-04-09 23:32:18 UTC to the time that the error occurs at 2015-04-09 23:36:38 UTC, this much time has elapsed:

0:04:20   (h:m:s)
260.00    total seconds
4.33      total minutes
0.07      total hours

Potential Related Gunicorn Issue

Given the symptoms, and proposed chain of events... This could potentially related to benoitc/gunicorn#414

Container Debug Info

The docker-registry container is running:

root@docker-registry:/# gunicorn -v
gunicorn (version 19.1.1)

And it has these python packages installed:

root@docker-registry:/# pip list
argparse (1.2.1)
backports.lzma (0.0.3)
blinker (1.3)
boto (2.34.0)
bugsnag (2.0.2)
chardet (2.0.1)
colorama (0.2.5)
docker-registry (0.9.1)
docker-registry-core (2.0.3)
Flask (0.10.1)
Flask-Cors (1.10.3)
gevent (1.0.1)
greenlet (0.4.5)
gunicorn (19.1.1)
html5lib (0.999)
itsdangerous (0.24)
Jinja2 (2.7.3)
M2Crypto (0.22.3)
MarkupSafe (0.23)
newrelic (2.22.1.20)
pip (1.5.4)
PyYAML (3.11)
redis (2.10.3)
requests (2.3.0)
setuptools (5.8)
simplejson (3.6.2)
six (1.5.2)
SQLAlchemy (0.9.4)
urllib3 (1.7.1)
WebOb (1.4)
Werkzeug (0.10.1)
wsgiref (0.1.2)

trinitronx commented 9 years ago

I'm able to reproduce this error easily by running the docker-registry-ui container:

boot2docker init
boot2docker up
eval $(boot2docker shellinit bash)

# start 
docker run -d \
  -p 80:80 \
  -p 443:443 \
  -v /var/run/docker.sock:/tmp/docker.sock \
  -v /etc/docker/certs.d/:/etc/nginx/certs \
  --name=nginx-proxy \
  -t trinitronx/nginx-proxy

# start docker-registry-ui
docker run -d \
          -p 8080:8080 \
          -e REG1=https://registry.example.com:443/v1/ \
          --name=docker-registry-ui
          atcol/docker-registry-ui

Watch the logs by opening 3 terminals:

# Terminal 1
docker logs -f nginx-proxy
# Terminal 2
docker logs -f docker-registry
# Terminal 3
docker logs -f docker-registry-ui

I then simply access the "Images" page through the docker-registry-ui Web UI via URL: http://ip-of-boot2docker-box:8080/repository/index.

Here are are all 3 anonymized logs

The other way to reproduce it, although not as easy perhaps, is to do lots of docker pull commands in succession for different images / tagged containers.

trinitronx commented 9 years ago

Got another instance of this bug, with different request for /repositories/*/*/images endpoint. However, same 504 Gateway Timeout, similar stacktrace & error Broken Pipe from gevent.

Here are the logs. There are a couple requests in the logs.. the first couple are from the initial docker pull that failed. The next couple are from 2 manual cURL requests I performed. One reproduced the error, the other didn't. This bug is odd in that it only occurs sometimes, but not always. Usually a second retry is successful. Like I mentioned before, the easiest way to reproduce it is to run the registry UI and just hit some pages to search around for images.

This bug looks a lot like gevent/gevent#445

trinitronx commented 9 years ago

There are other similar bugs on gevent's GitHub issue tracker

The ones that most notably seem related:

trinitronx commented 9 years ago

After enabling the SEARCH_BACKEND and GUNICORN_OPTS='[--preload]' options to the docker registry, we ran into stability issues (docker/docker-registry#540).

After disabling both of these options, these stability issues went away and errors became much less frequent (although seemed to be present but still sometimes occurred).

Didn't have enough time to investigate further, but turning off search and disabling the --preload option did alleviate the issue.

Since this project is deprecated, hopefully this helps anyone still stuck on Registry V1.