docker / hub-feedback

Feedback and bug reports for the Docker Hub
https://hub.docker.com
232 stars 40 forks source link

Connection issues when pulling public images (extremely slow) #2161

Closed komodin closed 2 years ago

komodin commented 2 years ago

Hi, Since last week we're experiencing an aleatory extremely slow speed when pulling public images.

It not only happens when doing a docker pull from a standalone docker instance, but also happens from our Kubernetes cluster.

We've tried restarting Docker and purging DNS cache but did not had any effects.

It's kinda random, because if I do CTRL+C and restart the download, it works fine (like 1 in 5 tries).

image

All the connections are open to 104.18.124.25 (production.cloudflare.docker.com). If I do a traceroute, the hop after 149.6.39.210 varies between 188.114.108.7, 172.70.56.3, 172.70.60.2 and 172.70.58.2.

I'm attaching a MTR.

image

Geographic location of the system

Barcelona, Spain

IP address

AS174 (no specific IP but the subnet is 149.7.0.0/16)

Cloud provider and region (if applicable)

Cogent

komodin commented 2 years ago

Hi @binman-docker , sorry for mentioning you. I think this may have to be with https://github.com/docker/hub-feedback/issues/1675

If there is something you can do / talk with the CDN provider, we will really appreciate it :)

Thanks in advance

kahootali commented 2 years ago

I can also second this, I am also facing quite slow image pulls.. it takes 10mins to pull nginx image or 15 mins to pull openjdk image especially in Github Actions workflows

kahootali commented 2 years ago
#8 [builder 1/5] FROM docker.io/library/gradle:jdk14@sha256:f3083db83022eb3246a53e74ba6288e78d1707fe5fc378d1295458d9a9175912
#8 extracting sha256:7d84ae7bfa87e39f74a0fa3577f9d935a0efe06a8b94619bc750f9732f374358 0.0s done
#8 extracting sha256:55515e3d0de13c21bbc1c0c5d3f21ce3f610a7c94ef83c246a1bc5851d4c36fc
#8 extracting sha256:55515e3d0de13c21bbc1c0c5d3f21ce3f610a7c94ef83c246a1bc5851d4c36fc 1.8s done
#8 extracting sha256:34dc550380e3636043dfe9ec34b8dc83447fdc4a9d788277f37f45246358ec46
#8 extracting sha256:34dc550380e3636043dfe9ec34b8dc83447fdc4a9d788277f37f45246358ec46 1.5s done
#8 DONE 1572.1s

pulling gradle:jdk14 image took 1572 seconds

komodin commented 2 years ago

Hi, We're still experiencing this issue.

Is there any chance that you can contact the CDN provider to analyze it?

Thanks

ingshtrom commented 2 years ago

This one got missed as well from the old ticket: https://github.com/docker/hub-feedback/issues/1675 cc @steamboatid

I am sorry to hear that you are continuing to have issues pulling images from Docker Hub. Here are two comments that are still very useful with information that would help us debug things in addition to addition tools to help with debugging. https://github.com/docker/hub-feedback/issues/1675#issuecomment-904025495 and https://github.com/docker/hub-feedback/issues/1675#issuecomment-908425896

The most important pieces are the HAR file and a Cloudflare trace ID, although the rest can be useful as well. The MTR was interesting above, but unfortunately only shows you jumping between a few edge Cloudflare nodes.

I created a script to pull docker images from Docker Hub with bash, tee, jq, and curl. Please give it a run a report back your full log in a gist (to avoid making these issues hard to read because he stdout will be really long). This will automatically provide the Cloudflare ray IDs for us in addition to some other useful timing information. Do know that the script makes several assumptions and I only tested on Docker Official Images. Also, it does scrub the Authorization header for you automatically; I don't think there is any other sensitive data in the docker_pull.log file it generates.

Docker pull in bash

ingshtrom commented 2 years ago

it takes 10mins to pull nginx image or 15 mins to pull openjdk image especially in Github Actions workflows

@kahootali are you using GitHub managed runners or self-hosted GHA runners? Any public runs you can link to?

kahootali commented 2 years ago

@ingshtrom yes we are using Github managed runners, the repos are private so can't link to the runs

ingshtrom commented 2 years ago

I would be interested to see that "Docker pull in bash" run as well because I haven't seen these issues in GHA before.

Are they also consistent?

It is very strange to me that GitHub runners in Azure would not have an extremely fast and reliable connection to Cloudflare 🤔

ahaverty commented 2 years ago

Not sure if helpful, but adding my details on stalled/restarting docker image pull issues:

Date and time of the issue, and whether it is still happening or intermittent: All day yesterday (2021-11-04) and today (2021-11-05)

Which image - all images, one specific image? Home Assistant: https://github.com/home-assistant/core/pkgs/container/odroid-n2-homeassistant/9786309?tag=2021.11.0

OS/Version Home Assistant OS

Geographic location of the system Dublin, Ireland

ingshtrom commented 2 years ago

@ahaverty that is a GitHub Registry which you are pulling from, thus I don't think it is a Docker issue. I can pull it myself using Docker Desktop on Mac and with "Docker Pull in Bash", although I'm on another part of the world 😄 Could you try pulling using Docker Pull in Bash?

ahaverty commented 2 years ago

Thanks and apologies @ingshtrom , my download eventually finished in the last hour! I appreciate the suggestions though 🙌

kahootali commented 2 years ago

I haven't seen slow pulls now, it was at that day, which I reported on Github Runners, but now I can see the pulls are fine, at least for now

komodin commented 2 years ago

This one got missed as well from the old ticket: #1675 cc @steamboatid

I am sorry to hear that you are continuing to have issues pulling images from Docker Hub. Here are two comments that are still very useful with information that would help us debug things in addition to addition tools to help with debugging. #1675 (comment) and #1675 (comment)

The most important pieces are the HAR file and a Cloudflare trace ID, although the rest can be useful as well. The MTR was interesting above, but unfortunately only shows you jumping between a few edge Cloudflare nodes.

I created a script to pull docker images from Docker Hub with bash, tee, jq, and curl. Please give it a run a report back your full log in a gist (to avoid making these issues hard to read because he stdout will be really long). This will automatically provide the Cloudflare ray IDs for us in addition to some other useful timing information. Do know that the script makes several assumptions and I only tested on Docker Official Images. Also, it does scrub the Authorization header for you automatically; I don't think there is any other sensitive data in the docker_pull.log file it generates.

Docker pull in bash

Hi @ingshtrom, We did some of the tests you suggested:

Traceroute to production.cloudflare.docker.com:

$ traceroute production.cloudflare.docker.com
traceroute to production.cloudflare.docker.com (104.18.124.25), 30 hops max, 60 byte packets
 1  149.7.xxx.xxx (149.7.xxx.xxx)  0.364 ms  0.265 ms  0.215 ms
 2  te0-3-1-8.nr51.b057442-1.bcn01.atlas.cogentco.com (149.6.131.25)  0.829 ms  0.750 ms  0.707 ms
 3  be2140.rcr21.bcn01.atlas.cogentco.com (154.25.10.45)  1.339 ms  1.281 ms  1.373 ms
 4  be2421.ccr31.vlc02.atlas.cogentco.com (154.54.61.197)  5.125 ms  5.075 ms  5.025 ms
 5  be3356.ccr32.mad05.atlas.cogentco.com (154.54.57.241)  9.634 ms  9.585 ms  9.534 ms
 6  * 149.6.39.210 (149.6.39.210)  22.809 ms  22.735 ms
 7  172.70.58.2 (172.70.58.2)  9.560 ms  10.117 ms  10.059 ms
 8  104.18.124.25 (104.18.124.25)  9.298 ms  9.252 ms  9.190 ms

Traceroute to registry-1.docker.io:

$ traceroute registry-1.docker.io
traceroute to registry-1.docker.io (52.72.232.213), 30 hops max, 60 byte packets
 1  149.7.xxx.xxx (149.7.xxx.xxx)  0.347 ms  0.284 ms  0.238 ms
 2  te0-3-1-8.nr51.b057442-1.bcn01.atlas.cogentco.com (149.6.131.25)  0.689 ms  0.646 ms  0.606 ms
 3  be2140.rcr21.bcn01.atlas.cogentco.com (154.25.10.45)  1.089 ms  1.008 ms  0.962 ms
 4  be2421.ccr31.vlc02.atlas.cogentco.com (154.54.61.197)  5.664 ms  5.620 ms  5.143 ms
 5  be3355.ccr31.mad05.atlas.cogentco.com (154.54.57.229)  9.787 ms  9.745 ms  9.704 ms
 6  be2324.ccr31.bio02.atlas.cogentco.com (154.54.61.129)  14.172 ms  14.237 ms  14.164 ms
 7  be2331.ccr41.dca01.atlas.cogentco.com (154.54.85.241)  85.588 ms  85.750 ms  85.679 ms
 8  38.140.146.178 (38.140.146.178)  97.370 ms  97.334 ms  97.291 ms
 9  * * *
10  * * *
11  52.93.28.144 (52.93.28.144)  94.957 ms 52.93.29.202 (52.93.29.202)  94.917 ms  110.655 ms
12  * * *
13  * * *
14  * * *
15  * * *
16  * * *
17  * * *
18  * * *
19  * * *
20  * * *

cURL to https://production.cloudflare.docker.com/cdn-cgi/trace:

$ curl 'https://production.cloudflare.docker.com/cdn-cgi/trace'
fl=367f40
h=production.cloudflare.docker.com
ip=149.7.xxx.xxx
ts=1636375591.24
visit_scheme=https
uag=curl/7.74.0
colo=MAD
http=http/2
loc=ES
tls=TLSv1.2
sni=plaintext
warp=off
gateway=off

Tests using speed-cloudflare-cli. Sometimes it does the 100kB test, but almost always gets stuck:

$ time npx speed-cloudflare-cli
 Server location: Madrid (MAD)
         Your IP: 149.7.xxx.xxx (ES)
         Latency: 88.77 ms
          Jitter: 82.00 ms
^C

real    6m13.787s
user    0m1.616s
sys     0m0.286s

$ time npx speed-cloudflare-cli
 Server location: Madrid (MAD)
         Your IP: 149.7.xxx.xxx (ES)
         Latency: 89.09 ms
          Jitter: 15.52 ms
^C

real    1m50.024s
user    0m1.662s
sys     0m0.323s

$ time npx speed-cloudflare-cli
 Server location: Madrid (MAD)
         Your IP: 149.7.xxx.xxx (ES)
         Latency: 75.58 ms
          Jitter: 13.25 ms
^C

real    1m54.448s
user    0m1.397s
sys     0m0.234s

And here I'm attaching two docker_pull.sh tests. One that goes really slow:

docker_pull.log

And after some re-runs, one that goes fine:

docker_pull-ok.log

Please let me know if I can share some other useful information.

Thanks!

ingshtrom commented 2 years ago

@komodin I think we are getting closer as it seems you do not have a stable connection to Cloudflare, and that is how ALL Docker image layers will be downloaded for you. So somewhere between your client and the Cloudflare MAD (Madrid, Spain) edge datacenter there is a throttle that goes on and off, unless it is Cloudflare itself doing something weird.

Here is an example of me running it.

🚀 time npx speed-cloudflare-cli
 Server location: Chicago (ORD)
         Your IP: 73.75.245.202 (US)
         Latency: 37.27 ms
          Jitter: 1.17 ms
     100kB speed: 268.19 Mbps
       1MB speed: 142.70 Mbps
      10MB speed: 288.63 Mbps
      25MB speed: 338.17 Mbps
     100MB speed: 341.09 Mbps
  Download speed: 335.35 Mbps
    Upload speed: 28.94 Mbps
npx speed-cloudflare-cli  1.74s user 0.69s system 11% cpu 20.620 total

It looks like you are using Cogent as your ISP for wherever you are running this from... I wonder if anyone else is using Cogent?

komodin commented 2 years ago

@ingshtrom At first I thought the same, but after some research (and lots of traceroutes) and having in mind that after some retries it works fine, I saw that the first CloudFlare hop after leaving Cogent varies between this four IP's: 172.70.56.3, 172.70.58.2, 172.70.60.2 and 188.114.108.7.

If I do a ping -c 5000 to each of the IP's at same time, I see 172.70.56.3 loses 5.48% of packets (but it can reach 10% in some tests with less packet count):

image

image

image

image

I don't say that it is not Cogent fault, but I think it's worthy to also let CloudFlare know about this and they may do some investigation on their side.

Thanks!

ingshtrom commented 2 years ago

@komodin I will open another ticket with them. We have before and it wasn't a productive conversation. I think you have done a wonderful job gathering data, though, so maybe this will help them investigate better.

I will update in a couple days when we hear more from them.

komodin commented 2 years ago

@ingshtrom thanks! please, let me know if I can do any additional tests :)

ingshtrom commented 2 years ago

@komodin sorry, I didn't act quick enough. Can you reproduce a good and a bad run (for comparison). The Cloudflare Ray ID logs have expired, and we are not able to reproduce this internally at Docker.

If you do have trouble reproducing, you can try running ./docker_pull.sh nvidia/cuda 11.4.2-cudnn8-runtime-ubuntu20.04 registry-1.docker.io or ./docker_pull.sh library/ubuntu 16.04 registry-1.docker.io. The nvidia image is especially interesting because it includes a 1.3GB file, so we have seen that cause issues for users.

komodin commented 2 years ago

@ingshtrom it's okay, I've run it some times, when it goes well, it takes some seconds:

docker_pull-ok-2.log

But when it got stuck, it took like 10 minutes to finish:

docker_pull-err.log

ingshtrom commented 2 years ago

Beautiful, thank you! I will get this to Cloudflare ASAP

ingshtrom commented 2 years ago

@komodin are you able to run MTR showing the packet loss? They want to find out where exactly the packet loss is happening in order to debug this further because of the metrics they have readily available, none of them look bad

Microsoft Edge Beta20211112110447
komodin commented 2 years ago

@ingshtrom I'm attaching the MTR. Thanks for following up the issue!

Screen Shot 2021-11-13 at 03 07 47

ingshtrom commented 2 years ago

@komodin I know that it was still happening 4 days ago, but can you confirm that this is a continuing issue still today? The issue is getting escalated further inside Cloudflare and they would like confirmation that it is still happening.

Thank you!

komodin commented 2 years ago

@ingshtrom hi, thanks for the follow up! yes, it's still happening. I'm attaching a screenshot of what happens when it fails. If I retry, after one or two tries, it works fine.

Screen Shot 2021-11-17 at 16 59 56

Is there any way to dump the request/response headers using docker pull? I think'd on it to try to give you more information about the failed requests.

Thanks

ingshtrom commented 2 years ago

@komodin Unfortunately, no. You can run something like mitmproxy to catch all of this stuff, but it can be a pain to set up the first time so that you can inspect the data coming through. Not to mention not wanting to do that in production environments.

I will pass along that suggestion to our internal teams.

ingshtrom commented 2 years ago

@komodin Cloudflare cannot reproduce the issues from the edge colo's. They have turned on increased verbosity on their logs. Can you attempt to reproduce it again and note exact timestamps of when it happened (they should be in the log file if you continue to use that script).

FWIW I was able to reproduce this once against an edge colo in the United States in our staging environment, but it was really hit-or-miss.

komodin commented 2 years ago

@ingshtrom hi! Thanks for the update! I'm attaching a mitmproxy dump (I did it using mitmdump --listen-host 127.0.0.1 -s ./har_dump.py --set hardump=./dump/dump.har, I hope it's okay). I let it run for more than 40 minutes:

Screen Shot 2021-11-18 at 21 14 41

dump_second_request_failed_42min.har.zip

Tomorrow I'll ask one of my teammates to run again the script.

I wish this helps! Thanks again.

komodin commented 2 years ago

Oh, I just saw that the the last request (the one that gets stuck) is not logged. I think mitmproxy logs it when the connection is closed. I'll try to take the capture until it fails. Sorry!

komodin commented 2 years ago

Well, running mitmproxy (not mitmpdump) I can see the requests also when it's still active, and this is what I see when it gets stuck:

Screen Shot 2021-11-18 at 21 25 41

And this one went ok:

Screen Shot 2021-11-18 at 21 48 04

Now we also ran your script: failed: docker_pull-err-2.log ok: docker_pull-ok-3.log

ingshtrom commented 2 years ago

Thank you! You have been beyond helpful in nailing this down and I appreciate your patience. I have forwarded on all of this data and hopefully we should hear back soon.

ingshtrom commented 2 years ago

@komodin The Cloudflare engineers are attempting to reproduce this using the script I have given you. In addition, they are asking for a packet capture during one of the failed runs. I have tested this and I believe running the following command while running the docker_pull.sh script will give us the data we need: sudo tcpdump host registry-1.docker.io or production.cloudflare.docker.com or auth.docker.io

I have documented this in the Docker Pull in Bash repository.

Thank you!

komodin commented 2 years ago

Hi @ingshtrom, sorry for the slow response, it was tough days at work. We ran the script again until it failed and did the tcpdump capture.

Let me know if you need something else!

docker_pull_registry-1.docker.io._library-ubuntu_16.04.log tcpdump.log

Thanks!

komodin commented 2 years ago

Hi @ingshtrom, just as a friendly follow up, did you had any update from CloudFlare? Is there anything else that we can do to help you?

Thanks! :)

ingshtrom commented 2 years ago

What great timing! They responded last night that they believe they have identified the issue.

Hello Docker team,

Thanks for your patience w/ this issue. Our internal team tracked this down to problem w/ response stream batching. We have work in progress scheduled for Q1 to refactor this and hopefully improve inefficiencies going forward.

In the meantime we do want to track down any remaining complaints and make sure that any remaining issues are inline w/ our current issues.

Please confirm the current state of the issue and any recent complaints that you've fielded so we can review accordingly.

Can you confirm that you are still seeing the issue?

I will also be reaching out to Cloudflare to see if there are any mitigations we can do in the meantime.

Electronlibre2012 commented 2 years ago

Hello,

i still get extremely slow connection when pulling the last HA Core 2021.12.3 tonight

hope they can found a solution ;)

ingshtrom commented 2 years ago

@Electronlibre2012 if you can easily reproduce it with Docker Pull in Bash, that would be helpful to know if it is similar to the above issue. Thank you!

It is hard to tell, but there are some tell-tale signs like (1) the headers say it was a Cloudflare cache hit, (2) it works perfectly fine other times, (3) other layer downloads in the same image download quickly, (4) you are located in Eastern Europe, Asia, or Australia (we can get this from the client-ip if you provide it as well or you can just tell us)

Electronlibre2012 commented 2 years ago

i am located in Réunion Island (between South Africa an Mauricius Island)

i try restart the update after a host reboot and it update within a minute...lol

Thanks for your help

komodin commented 2 years ago

What great timing! They responded last night that they believe they have identified the issue.

Hello Docker team, Thanks for your patience w/ this issue. Our internal team tracked this down to problem w/ response stream batching. We have work in progress scheduled for Q1 to refactor this and hopefully improve inefficiencies going forward. In the meantime we do want to track down any remaining complaints and make sure that any remaining issues are inline w/ our current issues. Please confirm the current state of the issue and any recent complaints that you've fielded so we can review accordingly.

Can you confirm that you are still seeing the issue?

I will also be reaching out to Cloudflare to see if there are any mitigations we can do in the meantime.

Hey! Happy new year!

Yes, still happening at the moment of your comment (sorry, I thought I've replied) and now. It's good to hear that they found the root cause, thanks for keeping me updated during the investigation :)

ingshtrom commented 2 years ago

I'm sorry about asking for more information, but Cloudflare is requesting a packet capture again as described here. I'm asking them if there is anything else they need to move this forward.

ingshtrom commented 2 years ago

@komodin are you still seeing this issue? I just received back that the fix was deployed "a while back". I'm not sure when, but Cloudflare has confirmed that a fix has gone out for what they think is the root cause.

komodin commented 2 years ago

Hey @ingshtrom! We are not longer experiencing the issue, it's working fine since your last comment (January). Sorry for the radio silence.

We (as @immfly) want to thank you for all the efforts and for all the investigation and follow up of the issue with CloudFlare.

ingshtrom commented 2 years ago

What an adventure! Thank you so much for all of the feedback, patience, and debug logs!