containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
23.81k stars 2.42k forks source link

Image quay.io/podman/stable:v4.5 very often fails to pull from remote registries #18558

Closed Romain-Geissler-1A closed 11 months ago

Romain-Geissler-1A commented 1 year ago

Issue Description

Hi,

My builds using "podman in docker/podman" scenarios started to fail recently, being blocked on podman pull commands which hang for ever. It seems very specific to podman 4.5 as I had no issue with podman 4.4 before. Also, it really feels like there is some kind of race condition in lower parts of the networking stack, as rarely it works, and it kind of depends on the registry I am trying to reach. In my case, trying to pull from quay.io almost always fails with podman 4.5 while it always worked with 4.4. However pulling from registry.fedoraproject.org or our private Amadeus Artifactory registry, it always work no matter the podman version.

I could reproduce this issue with different hosts and container technologies, all these machines are located inside the Amadeus premisses using the same network: RHEL 7 x86_64 kernels running docker 20.10, RHEL 8 aarch64 kernels running docker 23 and podman 4.2 (for the first container technology), RHEL 9 x86_64 kernels running podman 4.2 (for the first container technology).

Cheers, Romain

Steps to reproduce the issue

Steps to reproduce the issue

  1. Try to pull a fedora image from quay.io, using quay.io/podman/stable:v4.4, it always works fine:
rgeissler@ncerndobedev6097:~> podman run --rm --privileged quay.io/podman/stable:v4.4 podman pull quay.io/fedora/fedora                                                                                        1:00PM
Trying to pull quay.io/fedora/fedora:latest...
Getting image source signatures
Copying blob sha256:de924d42b91d6d6898968a85cf956b216065442e876409dc098996c2f9f6282d
Copying config sha256:c9bfca6d0ac2a242226fa68db587ab26836e4b3fb5458284be7505e95b12c965
Writing manifest to image destination
Storing signatures
c9bfca6d0ac2a242226fa68db587ab26836e4b3fb5458284be7505e95b12c965
  1. Try the exact same thing with quay.io/podman/stable:v4.5, most of the time, it times out, very rarely, it works:
rgeissler@ncerndobedev6097:~> podman run --rm --privileged quay.io/podman/stable:v4.5 podman pull quay.io/fedora/fedora                                                                                        1:12PM
Trying to pull quay.io/fedora/fedora:latest...
^C
  1. The above command with more logs in the second nested podman:
rgeissler@ncerndobedev6097:~> podman run --rm --privileged quay.io/podman/stable:v4.5 podman --log-level=trace pull quay.io/fedora/fedora                                                                      1:13PM
time="2023-05-13T13:13:28Z" level=info msg="podman filtering at log level trace"
time="2023-05-13T13:13:28Z" level=debug msg="Called pull.PersistentPreRunE(podman --log-level=trace pull quay.io/fedora/fedora)"
   ... (output removed)
time="2023-05-13T13:13:28Z" level=debug msg="Looking for TLS certificates and private keys in /etc/docker/certs.d/quay.io"
time="2023-05-13T13:13:28Z" level=debug msg="GET https://quay.io/v2/"
time="2023-05-13T13:13:28Z" level=debug msg="Ping https://quay.io/v2/ status 401"
time="2023-05-13T13:13:28Z" level=debug msg="GET https://quay.io/v2/auth?scope=repository%3Afedora%2Ffedora%3Apull&service=quay.io"
time="2023-05-13T13:13:29Z" level=debug msg="Increasing token expiration to: 60 seconds"
time="2023-05-13T13:13:29Z" level=debug msg="GET https://quay.io/v2/fedora/fedora/manifests/latest"
time="2023-05-13T13:13:29Z" level=debug msg="Content-Type from manifest GET is \"application/vnd.docker.distribution.manifest.list.v2+json\""
time="2023-05-13T13:13:29Z" level=debug msg="Using blob info cache at /var/lib/containers/cache/blob-info-cache-v1.boltdb"
time="2023-05-13T13:13:29Z" level=debug msg="Source is a manifest list; copying (only) instance sha256:158d1c036343866773afa7a7d0412f1479f2f78b881d59528ce61fd29a11e95f for current system"
time="2023-05-13T13:13:29Z" level=debug msg="GET https://quay.io/v2/fedora/fedora/manifests/sha256:158d1c036343866773afa7a7d0412f1479f2f78b881d59528ce61fd29a11e95f"
^C
  1. Pulling the exact same image from another public registry always works:
rgeissler@ncerndobedev6097:~> podman run --rm --privileged quay.io/podman/stable:v4.5 podman pull registry.fedoraproject.org/fedora                                                                            1:16PM
Trying to pull registry.fedoraproject.org/fedora:latest...
Getting image source signatures
Copying blob sha256:de924d42b91d6d6898968a85cf956b216065442e876409dc098996c2f9f6282d
Copying config sha256:c9bfca6d0ac2a242226fa68db587ab26836e4b3fb5458284be7505e95b12c965
Writing manifest to image destination
Storing signatures
c9bfca6d0ac2a242226fa68db587ab26836e4b3fb5458284be7505e95b12c965

Describe the results you received

Timeouts most of the time (sometimes I waited for tens and tens of minutes and it miraculously "unblocked" and went ahead).

Describe the results you expected

Pull should always work, from all registries (if registries are up and running fine).

podman info output

See above, the first layer of container technology doesn't really matters, it happens with docker/podman, x86_64/aarch64 kernels, RHEL 7/8/9 kernels, rootless or rootful for the first layer of container technology. And it always work with the image `quay.io/podman/stable:v4.4`

Podman in a container

Yes

Privileged Or Rootless

None

Upstream Latest Release

Yes

Additional environment details

No response

Additional information

No response

rhatdan commented 1 year ago

@mtrmac PTAL

mtrmac commented 1 year ago

time="2023-05-13T13:13:29Z" level=debug msg="GET https://quay.io/v2/fedora/fedora/manifests/sha256:158d1c036343866773afa7a7d0412f1479f2f78b881d59528ce61fd29a11e95f" That’s a request for a small bit of metadata… I don’t know why it would hang at the registry side.

One possibly relevant thing that did change between 4.4 and 4.5 is vendor/github.com/containers/image/v5/pkg/tlsclientconfig/tlsclientconfig.go in https://github.com/containers/podman/commit/76056c67018073c56104f8c09278a40f1e313aa2#diff-659d4f75a65c79717fcb7a4ea0553085750e14a829e295a36f5ceca4341bda36 , enabling HTTP persistent connections.

Could you possibly try patching that out to see if that’s directly connected?

Romain-Geissler-1A commented 1 year ago

Very good catch !

I tried with this dockerfile:

FROM quay.io/podman/stable as builder

ARG PACKAGE_NAME=podman
ARG AMADEUS_PATCH_VERSION_SUFFIX=.PatchedByAmadeus.1

RUN dnf install -y dnf-plugins-core rpmdevtools

RUN ( \
    mkdir -p /workdir/source && \
    mkdir -p /workdir/output && \
    cd /workdir/source && \
    dnf download --source "${PACKAGE_NAME}" \
)

RUN dnf builddep -y /workdir/source/${PACKAGE_NAME}*.rpm

RUN rpm -ivh /workdir/source/${PACKAGE_NAME}*.rpm
ADD disable-http-keep-alive.patch /root/rpmbuild/SOURCES/disable-http-keep-alive.patch
# Append a patch after the first "SourceXXX: " line in the spec file.
RUN sed -i -n -e '1h;1!H;${g;s/^Source[[:digit:]]\+:.*$/&\nPatch9999: disable-http-keep-alive.patch/m;p}'  "${HOME}/rpmbuild/SPECS/${PACKAGE_NAME}.spec"
RUN sed -i -e "s/^Release: \(.\+\)/Release: \1${AMADEUS_PATCH_VERSION_SUFFIX}/" "${HOME}/rpmbuild/SPECS/${PACKAGE_NAME}.spec"

RUN rpmbuild -bb "${HOME}/rpmbuild/SPECS/${PACKAGE_NAME}.spec"

# Only copy some packages:
RUN cp -a ${HOME}/rpmbuild/RPMS/$(uname -m)/podman-4.5.0-*${AMADEUS_PATCH_VERSION_SUFFIX}.*.rpm /workdir/output

FROM quay.io/podman/stable

COPY --from=builder /workdir/output /workdir/output

RUN dnf install -y /workdir/output/*.rpm

and locally the file disable-http-keep-alive.patch with this content:

diff --git a/vendor/github.com/containers/image/v5/pkg/tlsclientconfig/tlsclientconfig.go b/vendor/github.com/containers/image/v5/pkg/tlsclientconfig/tlsclientconfig.go
index 56b0d4939..b6f73b799 100644
--- a/vendor/github.com/containers/image/v5/pkg/tlsclientconfig/tlsclientconfig.go
+++ b/vendor/github.com/containers/image/v5/pkg/tlsclientconfig/tlsclientconfig.go
@@ -96,8 +96,7 @@ func NewTransport() *http.Transport {
                Proxy:               http.ProxyFromEnvironment,
                DialContext:         direct.DialContext,
                TLSHandshakeTimeout: 10 * time.Second,
-               IdleConnTimeout:     90 * time.Second,
-               MaxIdleConns:        100,
+               DisableKeepAlives: true,
        }
        return tr
 }

I have built the same dockerfile with and without the patch (ie by commenting the line adding a Patch9999. The stock podman 4.5 is in the image 5aaf6926b12e6a06e17f89ebf8445f07a00b20051c9ea2e044009d75b1cbf577 and the patched podman 4.5 (with HTTP keep alive disabled) is the image fc8fe7c50f8fc768d7e1625bc4f34f12db0fb84682b118f9dd4baf3e7394620d. Running some tests, I indeed see that it fails with the stock podman 4.5 and works with the patched podman 4.5:

rgeissler@ncerndobedev6097:~/wk/podman (investigate-podman-v4.5-regression *)> podman run --rm --privileged 5aaf6926b12e6a06e17f89ebf8445f07a00b20051c9ea2e044009d75b1cbf577 podman pull quay.io/fedora/fedora
Trying to pull quay.io/fedora/fedora:latest...
^C
rgeissler@ncerndobedev6097:~/wk/podman (investigate-podman-v4.5-regression *)> podman run --rm --privileged 5aaf6926b12e6a06e17f89ebf8445f07a00b20051c9ea2e044009d75b1cbf577 podman pull quay.io/fedora/fedora
Trying to pull quay.io/fedora/fedora:latest...
^C
rgeissler@ncerndobedev6097:~/wk/podman (investigate-podman-v4.5-regression *)> podman run --rm --privileged 5aaf6926b12e6a06e17f89ebf8445f07a00b20051c9ea2e044009d75b1cbf577 podman pull quay.io/fedora/fedora
Trying to pull quay.io/fedora/fedora:latest...
^C
rgeissler@ncerndobedev6097:~/wk/podman (investigate-podman-v4.5-regression *)> podman run --rm --privileged fc8fe7c50f8fc768d7e1625bc4f34f12db0fb84682b118f9dd4baf3e7394620d podman pull quay.io/fedora/fedora
Trying to pull quay.io/fedora/fedora:latest...
Getting image source signatures
Copying blob sha256:de924d42b91d6d6898968a85cf956b216065442e876409dc098996c2f9f6282d
Copying config sha256:c9bfca6d0ac2a242226fa68db587ab26836e4b3fb5458284be7505e95b12c965
Writing manifest to image destination
Storing signatures
c9bfca6d0ac2a242226fa68db587ab26836e4b3fb5458284be7505e95b12c965
rgeissler@ncerndobedev6097:~/wk/podman (investigate-podman-v4.5-regression *)> podman run --rm --privileged fc8fe7c50f8fc768d7e1625bc4f34f12db0fb84682b118f9dd4baf3e7394620d podman pull quay.io/fedora/fedora
Trying to pull quay.io/fedora/fedora:latest...
Getting image source signatures
Copying blob sha256:de924d42b91d6d6898968a85cf956b216065442e876409dc098996c2f9f6282d
Copying config sha256:c9bfca6d0ac2a242226fa68db587ab26836e4b3fb5458284be7505e95b12c965
Writing manifest to image destination
Storing signatures
c9bfca6d0ac2a242226fa68db587ab26836e4b3fb5458284be7505e95b12c965
rgeissler@ncerndobedev6097:~/wk/podman (investigate-podman-v4.5-regression *)> podman run --rm --privileged fc8fe7c50f8fc768d7e1625bc4f34f12db0fb84682b118f9dd4baf3e7394620d podman pull quay.io/fedora/fedora
Trying to pull quay.io/fedora/fedora:latest...
Getting image source signatures
Copying blob sha256:de924d42b91d6d6898968a85cf956b216065442e876409dc098996c2f9f6282d
Copying config sha256:c9bfca6d0ac2a242226fa68db587ab26836e4b3fb5458284be7505e95b12c965
Writing manifest to image destination
Storing signatures
c9bfca6d0ac2a242226fa68db587ab26836e4b3fb5458284be7505e95b12c965

So indeed there seems to be something wrong with the code dealing with reusing HTTP connections.

mtrmac commented 1 year ago

Thanks! That’s an important data point; further debugging would be harder. Client-side, I’d say we’d need to dump the TLS secrets (KeyLogWriter, random Google finds an https://github.com/joneskoo/http2-keylog/blob/b6e4051a241b2c96bfbb8dca8cd7c321fcfe436c/h2keylog-client/client.go#LL62C32-L62C44 example), and then see what is going on using Wireshark.

But to my knowledge we haven’t had widespread reports of quay.io breaking, and I think we would have heard.

Is there a corporate proxy involved, by any chance?

Romain-Geissler-1A commented 1 year ago

Well, it's done from a machine located in a big company, big enough so that I don't know the details of our network infrastructure. I am not aware of an actual HTTP proxy (and my environment doesn't define any "http(s)_proxy" variable), but yes for sure you can expect all the usual "security" stuff you usually find in corporations (if there is a proxy, it's a transparent one which I am not aware of).

Si it would mean we have the very same issue with any go program which doesn't disable HTTP keep alives ? It's strange, I have never faced this so far. And it works with other registries because these ones would not take into account HTTP keep alives ?

mtrmac commented 1 year ago

if there is a proxy, it's a transparent one which I am not aware of

For that proxy to directly interfere with Connection: Keep-Alive, it would need to break TLS; so the client should need to have a corporate CA, or something like that, installed, even with a notionally-transparent proxy.

I suppose it’s possible that the proxy/firewall is interfering with just the sequence of packets without understanding the details…

It might be reasonably easy (or, maybe, impossible) to confirm whether something like this is involved by running the same commands with the same software from an outside network.


it would mean we have the very same issue with any go program which doesn't disable HTTP keep alives ? … And it works with other registries …

I don‘t know where to even start guessing about what exactly that proxy/firewall, if it indeed is a proxy/firewall and not a straightforward Podman bug, doesn’t like. (Purely hypothetically, the first request might trigger a content-specific validator based on Content-Type or something like that, and that validator might be buggy about continuing to process the following requests on a Connection: Keep-Alive socket, in some situations.)

I would certainly not rule out a straightforward Podman bug at this point — or it might be a combination of an interfering proxy/firewall and a Podman bug.

Romain-Geissler-1A commented 1 year ago

For that proxy to directly interfere with Connection: Keep-Alive, it would need to break TLS; so the client should need to have a corporate CA, or something like that, installed, even with a notionally-transparent proxy.

We have such a "deep packet inspection" "security" tool installed on our laptops, and actually this creates lots of issues as it breaks all TLS flows... But the machine from which I run all these tests is not a corporate laptop, which doesn't suffer from any TLS man-in-the-middle issue (no corporate CA is needed for web traffic).

It might be reasonably easy (or, maybe, impossible) to confirm whether something like this is involved by running the same commands with the same software from an outside network.

Mmmh for this I would need to ask some colleagues to test as the only Linux machines I easily have access to are inside the corporate network.

So I guess I have to investigate further on my side and add traces everywhere to understand what's going on.

mtrmac commented 1 year ago

I think it’s very fair to say that the pull infrastructure should expose options to get these kind of traces without having to recompile & patch :) Part of that is productizing https://github.com/containers/image/pull/201 , but I don’t think we’ve ever had KeyLogWriter exposed.

github-actions[bot] commented 1 year ago

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] commented 1 year ago

A friendly reminder that this issue had no activity for 30 days.

Romain-Geissler-1A commented 1 year ago

Hi,

I still have the issue, but indeed it seems related to some special configuration inside Amadeus (don't know if the problem is on our side or in podman/go side though). I still had no time to investigate this issue more in depth. I tried last friday with the latest podman upstream image and the problem is still present for us. As I guess you will ship podman 4.6 in the next RHEL point update, I will have to find a solution anyway as it will start to block more builds on our side in the future.

Cheers, Romain

vrothberg commented 1 year ago

@Romain-Geissler-1A are you still seeing the issue?

Romain-Geissler-1A commented 11 months ago

Hi,

I tried again just now, and I can't reproduce anymore. Note that in the meantime, the machine I use to test this was upgraded to RHEL 9.3/podman 4.6. However I think this was more related to some internal Amadeus network issue, which maybe have been resolved since then (I have no idea about any of the internal network changes recently).

So closing this ticket.

Thanks, Romain

vrothberg commented 11 months ago

Thanks, @Romain-Geissler-1A !