Closed oilbeater closed 9 years ago
I have solve this.
The root cause is that the process time of calculating checksum and writing to s3 is over 1 minute.There is no data transferring between registry and engine during this time, so the load balancer with default 1 minute timeout will close the connection.However the error message "Server error: 408" is really misleading.
Feel free to close it now.
@oilbeater I assume your load balancer is sending the 408? What do you suggest should be done to make this feel/look better? Thanks.
@dmp42 I use aws lb and have not investigated the close mechanism. However, as the one trip time between registry and engine may takes huge time when a layer size is big, there should be some keepalive message during this period. Otherwise in other situation the link may still be closed due to router, switch or other network device and it is more difficult to debug.
I set the ELB idle_timeout to 3600 seconds, and still get a 408 error from a 1.6.2 client pushing to registry v2.1.1 (via an nginx proxy). Nginx is logging a 499 which suggests that the client is giving up.
@mrwacky42 very large layer? What happens if you push directly to the registry server (no ELB and no NGINX is the middle)?
@dmp42 - The 1.6.2 client is CircleCI. I do not have any 1.6.x docker clients otherwise available to me. But yes, a very large layer.
@mrwacky42 ok for docker version - still, can you try pushing directly to your registry, instead of ELB+nginx+registry?
@dmp42 Yes. In fact, with 1.8.2 client I am able to push from an AWS instance to the ELB (and bypassing ELB).
Further, I've just discovered that I did not actually change the idle_timeout on the ELB, so I'm testing again. \ EDIT ** I tried twice, and once it worked and once CircleCI timed out due to lack of output from Docker.
@oilbeater Thank you so much for the info. I was also getting the the same issue with the docker. I have just increased the time out of the load balancer in amazon from 60 seconds to 500 seconds. it really did work out. I was really confused with the error of 408.
Received HTTP code 408 while uploading layer: ""
The image size which it was pushing was big 558.9 and it stucked around 558.4.
When the docker 1.7.1 client try to pushing a very large layer(1.5GB),I get server error message like this:
The error layer size is 1.5GB. I use top and nload to monitor cpu and network metrics, it seems that the client has finished archiving the layer as CPU usage lows down and start transmitting through network, a few seconds after the network traffic lows down(not sure all data has been transmitted), client print the previous error message.
Every time I repeat pushing this image or some image with big layer over 1.5 GB, same error occurs to me. However, pushing to registry 0.9 is ok. Both registry has s3 as backend storage.
I am not sure the problem is on client side or registry side.
Here is my environment info: