Closed Romain-Geissler-1A closed 11 months ago
@mtrmac PTAL
time="2023-05-13T13:13:29Z" level=debug msg="GET https://quay.io/v2/fedora/fedora/manifests/sha256:158d1c036343866773afa7a7d0412f1479f2f78b881d59528ce61fd29a11e95f"
That’s a request for a small bit of metadata… I don’t know why it would hang at the registry side.
One possibly relevant thing that did change between 4.4 and 4.5 is vendor/github.com/containers/image/v5/pkg/tlsclientconfig/tlsclientconfig.go
in https://github.com/containers/podman/commit/76056c67018073c56104f8c09278a40f1e313aa2#diff-659d4f75a65c79717fcb7a4ea0553085750e14a829e295a36f5ceca4341bda36 , enabling HTTP persistent connections.
Could you possibly try patching that out to see if that’s directly connected?
Very good catch !
I tried with this dockerfile:
FROM quay.io/podman/stable as builder
ARG PACKAGE_NAME=podman
ARG AMADEUS_PATCH_VERSION_SUFFIX=.PatchedByAmadeus.1
RUN dnf install -y dnf-plugins-core rpmdevtools
RUN ( \
mkdir -p /workdir/source && \
mkdir -p /workdir/output && \
cd /workdir/source && \
dnf download --source "${PACKAGE_NAME}" \
)
RUN dnf builddep -y /workdir/source/${PACKAGE_NAME}*.rpm
RUN rpm -ivh /workdir/source/${PACKAGE_NAME}*.rpm
ADD disable-http-keep-alive.patch /root/rpmbuild/SOURCES/disable-http-keep-alive.patch
# Append a patch after the first "SourceXXX: " line in the spec file.
RUN sed -i -n -e '1h;1!H;${g;s/^Source[[:digit:]]\+:.*$/&\nPatch9999: disable-http-keep-alive.patch/m;p}' "${HOME}/rpmbuild/SPECS/${PACKAGE_NAME}.spec"
RUN sed -i -e "s/^Release: \(.\+\)/Release: \1${AMADEUS_PATCH_VERSION_SUFFIX}/" "${HOME}/rpmbuild/SPECS/${PACKAGE_NAME}.spec"
RUN rpmbuild -bb "${HOME}/rpmbuild/SPECS/${PACKAGE_NAME}.spec"
# Only copy some packages:
RUN cp -a ${HOME}/rpmbuild/RPMS/$(uname -m)/podman-4.5.0-*${AMADEUS_PATCH_VERSION_SUFFIX}.*.rpm /workdir/output
FROM quay.io/podman/stable
COPY --from=builder /workdir/output /workdir/output
RUN dnf install -y /workdir/output/*.rpm
and locally the file disable-http-keep-alive.patch
with this content:
diff --git a/vendor/github.com/containers/image/v5/pkg/tlsclientconfig/tlsclientconfig.go b/vendor/github.com/containers/image/v5/pkg/tlsclientconfig/tlsclientconfig.go
index 56b0d4939..b6f73b799 100644
--- a/vendor/github.com/containers/image/v5/pkg/tlsclientconfig/tlsclientconfig.go
+++ b/vendor/github.com/containers/image/v5/pkg/tlsclientconfig/tlsclientconfig.go
@@ -96,8 +96,7 @@ func NewTransport() *http.Transport {
Proxy: http.ProxyFromEnvironment,
DialContext: direct.DialContext,
TLSHandshakeTimeout: 10 * time.Second,
- IdleConnTimeout: 90 * time.Second,
- MaxIdleConns: 100,
+ DisableKeepAlives: true,
}
return tr
}
I have built the same dockerfile with and without the patch (ie by commenting the line adding a Patch9999
. The stock podman 4.5 is in the image 5aaf6926b12e6a06e17f89ebf8445f07a00b20051c9ea2e044009d75b1cbf577
and the patched podman 4.5 (with HTTP keep alive disabled) is the image fc8fe7c50f8fc768d7e1625bc4f34f12db0fb84682b118f9dd4baf3e7394620d
. Running some tests, I indeed see that it fails with the stock podman 4.5 and works with the patched podman 4.5:
rgeissler@ncerndobedev6097:~/wk/podman (investigate-podman-v4.5-regression *)> podman run --rm --privileged 5aaf6926b12e6a06e17f89ebf8445f07a00b20051c9ea2e044009d75b1cbf577 podman pull quay.io/fedora/fedora
Trying to pull quay.io/fedora/fedora:latest...
^C
rgeissler@ncerndobedev6097:~/wk/podman (investigate-podman-v4.5-regression *)> podman run --rm --privileged 5aaf6926b12e6a06e17f89ebf8445f07a00b20051c9ea2e044009d75b1cbf577 podman pull quay.io/fedora/fedora
Trying to pull quay.io/fedora/fedora:latest...
^C
rgeissler@ncerndobedev6097:~/wk/podman (investigate-podman-v4.5-regression *)> podman run --rm --privileged 5aaf6926b12e6a06e17f89ebf8445f07a00b20051c9ea2e044009d75b1cbf577 podman pull quay.io/fedora/fedora
Trying to pull quay.io/fedora/fedora:latest...
^C
rgeissler@ncerndobedev6097:~/wk/podman (investigate-podman-v4.5-regression *)> podman run --rm --privileged fc8fe7c50f8fc768d7e1625bc4f34f12db0fb84682b118f9dd4baf3e7394620d podman pull quay.io/fedora/fedora
Trying to pull quay.io/fedora/fedora:latest...
Getting image source signatures
Copying blob sha256:de924d42b91d6d6898968a85cf956b216065442e876409dc098996c2f9f6282d
Copying config sha256:c9bfca6d0ac2a242226fa68db587ab26836e4b3fb5458284be7505e95b12c965
Writing manifest to image destination
Storing signatures
c9bfca6d0ac2a242226fa68db587ab26836e4b3fb5458284be7505e95b12c965
rgeissler@ncerndobedev6097:~/wk/podman (investigate-podman-v4.5-regression *)> podman run --rm --privileged fc8fe7c50f8fc768d7e1625bc4f34f12db0fb84682b118f9dd4baf3e7394620d podman pull quay.io/fedora/fedora
Trying to pull quay.io/fedora/fedora:latest...
Getting image source signatures
Copying blob sha256:de924d42b91d6d6898968a85cf956b216065442e876409dc098996c2f9f6282d
Copying config sha256:c9bfca6d0ac2a242226fa68db587ab26836e4b3fb5458284be7505e95b12c965
Writing manifest to image destination
Storing signatures
c9bfca6d0ac2a242226fa68db587ab26836e4b3fb5458284be7505e95b12c965
rgeissler@ncerndobedev6097:~/wk/podman (investigate-podman-v4.5-regression *)> podman run --rm --privileged fc8fe7c50f8fc768d7e1625bc4f34f12db0fb84682b118f9dd4baf3e7394620d podman pull quay.io/fedora/fedora
Trying to pull quay.io/fedora/fedora:latest...
Getting image source signatures
Copying blob sha256:de924d42b91d6d6898968a85cf956b216065442e876409dc098996c2f9f6282d
Copying config sha256:c9bfca6d0ac2a242226fa68db587ab26836e4b3fb5458284be7505e95b12c965
Writing manifest to image destination
Storing signatures
c9bfca6d0ac2a242226fa68db587ab26836e4b3fb5458284be7505e95b12c965
So indeed there seems to be something wrong with the code dealing with reusing HTTP connections.
Thanks! That’s an important data point; further debugging would be harder. Client-side, I’d say we’d need to dump the TLS secrets (KeyLogWriter
, random Google finds an https://github.com/joneskoo/http2-keylog/blob/b6e4051a241b2c96bfbb8dca8cd7c321fcfe436c/h2keylog-client/client.go#LL62C32-L62C44 example), and then see what is going on using Wireshark.
But to my knowledge we haven’t had widespread reports of quay.io breaking, and I think we would have heard.
Is there a corporate proxy involved, by any chance?
Well, it's done from a machine located in a big company, big enough so that I don't know the details of our network infrastructure. I am not aware of an actual HTTP proxy (and my environment doesn't define any "http(s)_proxy" variable), but yes for sure you can expect all the usual "security" stuff you usually find in corporations (if there is a proxy, it's a transparent one which I am not aware of).
Si it would mean we have the very same issue with any go program which doesn't disable HTTP keep alives ? It's strange, I have never faced this so far. And it works with other registries because these ones would not take into account HTTP keep alives ?
if there is a proxy, it's a transparent one which I am not aware of
For that proxy to directly interfere with Connection: Keep-Alive
, it would need to break TLS; so the client should need to have a corporate CA, or something like that, installed, even with a notionally-transparent proxy.
I suppose it’s possible that the proxy/firewall is interfering with just the sequence of packets without understanding the details…
It might be reasonably easy (or, maybe, impossible) to confirm whether something like this is involved by running the same commands with the same software from an outside network.
it would mean we have the very same issue with any go program which doesn't disable HTTP keep alives ? … And it works with other registries …
I don‘t know where to even start guessing about what exactly that proxy/firewall, if it indeed is a proxy/firewall and not a straightforward Podman bug, doesn’t like. (Purely hypothetically, the first request might trigger a content-specific validator based on Content-Type
or something like that, and that validator might be buggy about continuing to process the following requests on a Connection: Keep-Alive
socket, in some situations.)
I would certainly not rule out a straightforward Podman bug at this point — or it might be a combination of an interfering proxy/firewall and a Podman bug.
For that proxy to directly interfere with Connection: Keep-Alive, it would need to break TLS; so the client should need to have a corporate CA, or something like that, installed, even with a notionally-transparent proxy.
We have such a "deep packet inspection" "security" tool installed on our laptops, and actually this creates lots of issues as it breaks all TLS flows... But the machine from which I run all these tests is not a corporate laptop, which doesn't suffer from any TLS man-in-the-middle issue (no corporate CA is needed for web traffic).
It might be reasonably easy (or, maybe, impossible) to confirm whether something like this is involved by running the same commands with the same software from an outside network.
Mmmh for this I would need to ask some colleagues to test as the only Linux machines I easily have access to are inside the corporate network.
So I guess I have to investigate further on my side and add traces everywhere to understand what's going on.
I think it’s very fair to say that the pull infrastructure should expose options to get these kind of traces without having to recompile & patch :) Part of that is productizing https://github.com/containers/image/pull/201 , but I don’t think we’ve ever had KeyLogWriter
exposed.
A friendly reminder that this issue had no activity for 30 days.
A friendly reminder that this issue had no activity for 30 days.
Hi,
I still have the issue, but indeed it seems related to some special configuration inside Amadeus (don't know if the problem is on our side or in podman/go side though). I still had no time to investigate this issue more in depth. I tried last friday with the latest podman upstream image and the problem is still present for us. As I guess you will ship podman 4.6 in the next RHEL point update, I will have to find a solution anyway as it will start to block more builds on our side in the future.
Cheers, Romain
@Romain-Geissler-1A are you still seeing the issue?
Hi,
I tried again just now, and I can't reproduce anymore. Note that in the meantime, the machine I use to test this was upgraded to RHEL 9.3/podman 4.6. However I think this was more related to some internal Amadeus network issue, which maybe have been resolved since then (I have no idea about any of the internal network changes recently).
So closing this ticket.
Thanks, Romain
Thanks, @Romain-Geissler-1A !
Issue Description
Hi,
My builds using "podman in docker/podman" scenarios started to fail recently, being blocked on
podman pull
commands which hang for ever. It seems very specific to podman 4.5 as I had no issue with podman 4.4 before. Also, it really feels like there is some kind of race condition in lower parts of the networking stack, as rarely it works, and it kind of depends on the registry I am trying to reach. In my case, trying to pull fromquay.io
almost always fails with podman 4.5 while it always worked with 4.4. However pulling fromregistry.fedoraproject.org
or our private Amadeus Artifactory registry, it always work no matter the podman version.I could reproduce this issue with different hosts and container technologies, all these machines are located inside the Amadeus premisses using the same network: RHEL 7 x86_64 kernels running docker 20.10, RHEL 8 aarch64 kernels running docker 23 and podman 4.2 (for the first container technology), RHEL 9 x86_64 kernels running podman 4.2 (for the first container technology).
Cheers, Romain
Steps to reproduce the issue
Steps to reproduce the issue
quay.io/podman/stable:v4.4
, it always works fine:quay.io/podman/stable:v4.5
, most of the time, it times out, very rarely, it works:Describe the results you received
Timeouts most of the time (sometimes I waited for tens and tens of minutes and it miraculously "unblocked" and went ahead).
Describe the results you expected
Pull should always work, from all registries (if registries are up and running fine).
podman info output
Podman in a container
Yes
Privileged Or Rootless
None
Upstream Latest Release
Yes
Additional environment details
No response
Additional information
No response