docker-library / docker

Docker Official Image packaging for Docker
Apache License 2.0
1.09k stars 568 forks source link

Requests hang when pulling from github.com #471

Closed Hamxter closed 5 months ago

Hamxter commented 6 months ago

I've encountered a unique problem using DinD that I haven't been able to find a solution. I'm running DinD inside of a microk8s cluster to execute devops pipelines. The problem is that the containers running inside DinD cannot pull ANY content from github.com (and only github.com as far as I can tell) and just hangs after resolving the DNS and connecting.

Here is a sample of a request from a container not working inside DinD. Note, this is not isolated to the tooling or the repository, I've tried cURL and Node to make the request and cannot even get a response from wget github.com. I've also tried multiple different containers.

/ # docker run -it node:20 sh
# wget https://github.com/helmfile/helmfile/releases/download/v0.158.1/helmfile_0.158.1_linux_amd64.tar.gz
--2023-12-31 23:46:29--  https://github.com/helmfile/helmfile/releases/download/v0.158.1/helmfile_0.158.1_linux_amd64.tar.gz
Resolving github.com (github.com)... 20.248.137.48
Connecting to github.com (github.com)|20.248.137.48|:443... connected.

It just hangs after this point.

However, if I just run the request after exec'ing into DinD (not inside a container running in it) it works fine.

/ # wget https://github.com/helmfile/helmfile/releases/download/v0.158.1/helmfil
e_0.158.1_linux_amd64.tar.gz
Connecting to github.com (20.248.137.48:443)
Connecting to objects.githubusercontent.com (185.199.109.133:443)
saving to 'helmfile_0.158.1_linux_amd64.tar.gz'
helmfile_0.158.1_lin 100% |********************************| 20.3M  0:00:00 ETA
'helmfile_0.158.1_linux_amd64.tar.gz' saved

My DinD deployment is simple

image:
    repository: docker
    tag: 24-dind
    pullPolicy: IfNotPresent
  env:
    DOCKER_TLS_CERTDIR: /certs
  securityContext:
    privileged: true

Here are my nodes

NAME      STATUS                     ROLES    AGE    VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
rachel    Ready                      <none>   415d   v1.28.3   192.168.1.9    <none>        Ubuntu 22.04.3 LTS   5.15.0-91-generic   containerd://1.6.15
roy       Ready                      <none>   415d   v1.28.3   192.168.1.10   <none>        Ubuntu 22.04.3 LTS   5.15.0-91-generic   containerd://1.6.15
deckard   Ready,SchedulingDisabled   <none>   415d   v1.28.3   192.168.1.8    <none>        Ubuntu 22.04.3 LTS   5.15.0-91-generic   containerd://1.6.15

I have tried multiple different versions of DinD and couldn't get it to work. I tried replicating this in docker on my desktop (docker -> dind -> node:20) and it worked fine. Not sure what else to do here so any help would be greatly appreciated. Thanks

tianon commented 6 months ago

Hmm, is this a problem you only started seeing recently? It's really unlikely, but could possibly be related to #466 / #467 / #468 (unlikely because you're on Ubuntu 22.04, which shouldn't have issues with either iptables or nftables :see_no_evil:)

Hamxter commented 6 months ago

This is the first time I have deployed DinD so I don't have any previous data. I believe I have found the reason for the issue but am unsure how to fix it. I used ksniff to record the packet data of the DinD container during a request.

Wireshark_T35pDSNZLp

I think the main thing to look at is the duplicate ClientHello requests, one coming from the node20 container within DinD (which is expected), but also one coming from the DinD container itself (Identified with the kubernetes container IP). I did this request for other sites and can confirm that ALL activity is duplicated (this ranges from dns lookups to application data packets, there is always 2).

There is another problem in that the protocol used in this ClientHello request is TLSv1. All request I did for other sites used TLSv1.3 coming from the same container.

The only conclusion that I can come to is that GitHub is ignoring the ClientHello. This is either when duplicate requests are performed within such a short period of time, or when using TLSv1.

I believe the issue that I'm primarily trying to tackle here is the duplicate requests. I'm unsure where to go from here to try and debug this problem, so any help would be greatly appreciated.

Finally, here is a request that is similar to github content request, but from a different site where it performs successfully. Note, all of the TCP dupes.

Wireshark_p8u8s4VqAj

tianon commented 5 months ago

Now that #468 is merged and deployed, can you try again? (if it still doesn't work, try with DOCKER_IPTABLES_LEGACY=1 set :eyes:)

Hamxter commented 5 months ago

This did not fix the issue. It would be interesting if someone else could run ksniff on their DinD container to see if there are duplicated network calls like on my cluster. It might be a lower level issue (I'm running microk8s)

Hamxter commented 5 months ago

I figured out the problem. I use Project Calico as my CNI and it uses an MTU of 1440, DinD has a default MTU of 1500. This was discussed here https://github.com/projectcalico/calico/issues/2334. If anyone comes across this in the future to fix this I just added args: ["--mtu=1440"] to the DinD deployment.