distribution / distribution

The toolkit to pack, ship, store, and deliver container content
https://distribution.github.io/distribution
Apache License 2.0
8.88k stars 2.47k forks source link

Meet Sever Error 408 when pushing a large layer to registry v2 #755

Closed oilbeater closed 9 years ago

oilbeater commented 9 years ago

When the docker 1.7.1 client try to pushing a very large layer(1.5GB),I get server error message like this:

4b510ee63e03: Image already exists
a1e14256259e: Image already exists
5c305d14d4ac: Image already exists
91606cc1fef1: Buffering to Disk
91606cc1fef1: Image push failed
Error pushing to registry: Server error: 408 trying to push xobo/bjtu-xiaobo blob - sha256:415c0918210d111d42e7f534e5b9225c5ab7048e5d80667b4fedc44f2a7d6fa4

The error layer size is 1.5GB. I use top and nload to monitor cpu and network metrics, it seems that the client has finished archiving the layer as CPU usage lows down and start transmitting through network, a few seconds after the network traffic lows down(not sure all data has been transmitted), client print the previous error message.

Every time I repeat pushing this image or some image with big layer over 1.5 GB, same error occurs to me. However, pushing to registry 0.9 is ok. Both registry has s3 as backend storage.

I am not sure the problem is on client side or registry side.

Here is my environment info:

ubuntu@ip-172-31-9-142:~$ sudo docker version
Client version: 1.7.1
Client API version: 1.19
Go version (client): go1.4.2
Git commit (client): 786b29d
OS/Arch (client): linux/amd64
Server version: 1.7.1
Server API version: 1.19
Go version (server): go1.4.2
Git commit (server): 786b29d
OS/Arch (server): linux/amd64
buntu@ip-172-31-9-142:~$ sudo docker info
Containers: 0
Images: 2787
Storage Driver: devicemapper
 Pool Name: docker-202:80-1048577-pool
 Pool Blocksize: 65.54 kB
 Backing Filesystem: extfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 90.39 GB
 Data Space Total: 107.4 GB
 Data Space Available: 16.99 GB
 Metadata Space Used: 144.2 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.003 GB
 Udev Sync Supported: false
 Deferred Removal Enabled: false
 Data loop file: /mnt/docker/devicemapper/devicemapper/data
 Metadata loop file: /mnt/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.82-git (2013-10-04)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.19.8-031908-generic
Operating System: Ubuntu 14.04.2 LTS
CPUs: 2
Total Memory: 3.66 GiB
Name: ip-172-31-9-142
ID: 6NIM:UYQW:LB5L:SO7N:7RHR:NM34:ZQWD:UNW6:GMJV:UBI2:MBIR:3B2S
WARNING: No swap limit support
ubuntu@ip-172-31-1-2:~$ sudo docker exec registry registry -version
registry github.com/docker/distribution v2.0.1
oilbeater commented 9 years ago

I have solve this.

The root cause is that the process time of calculating checksum and writing to s3 is over 1 minute.There is no data transferring between registry and engine during this time, so the load balancer with default 1 minute timeout will close the connection.However the error message "Server error: 408" is really misleading.

Feel free to close it now.

dmp42 commented 9 years ago

@oilbeater I assume your load balancer is sending the 408? What do you suggest should be done to make this feel/look better? Thanks.

oilbeater commented 9 years ago

@dmp42 I use aws lb and have not investigated the close mechanism. However, as the one trip time between registry and engine may takes huge time when a layer size is big, there should be some keepalive message during this period. Otherwise in other situation the link may still be closed due to router, switch or other network device and it is more difficult to debug.

mrwacky42 commented 8 years ago

I set the ELB idle_timeout to 3600 seconds, and still get a 408 error from a 1.6.2 client pushing to registry v2.1.1 (via an nginx proxy). Nginx is logging a 499 which suggests that the client is giving up.

dmp42 commented 8 years ago

@mrwacky42 very large layer? What happens if you push directly to the registry server (no ELB and no NGINX is the middle)?

mrwacky42 commented 8 years ago

@dmp42 - The 1.6.2 client is CircleCI. I do not have any 1.6.x docker clients otherwise available to me. But yes, a very large layer.

dmp42 commented 8 years ago

@mrwacky42 ok for docker version - still, can you try pushing directly to your registry, instead of ELB+nginx+registry?

mrwacky42 commented 8 years ago

@dmp42 Yes. In fact, with 1.8.2 client I am able to push from an AWS instance to the ELB (and bypassing ELB).

Further, I've just discovered that I did not actually change the idle_timeout on the ELB, so I'm testing again. \ EDIT ** I tried twice, and once it worked and once CircleCI timed out due to lack of output from Docker.

rohanjoseph commented 8 years ago

@oilbeater Thank you so much for the info. I was also getting the the same issue with the docker. I have just increased the time out of the load balancer in amazon from 60 seconds to 500 seconds. it really did work out. I was really confused with the error of 408. Received HTTP code 408 while uploading layer: "" The image size which it was pushing was big 558.9 and it stucked around 558.4.