w/ GitLab Registry: "invalid status code from registry 404" #17999

Open vrothberg opened 1 year ago

Discussed in https://github.com/containers/podman/discussions/16842

^{Originally posted by **2xB** October 6, 2022} **Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)** /kind bug **Description** When pushing to a GitLab Container Registry, I sometimes randomly get `Error: trying to reuse blob sha256:... at destination: Requesting bearer token: invalid status code from registry 404 (Not Found)`. Weirdly enough, when I upload multiple images from the same base image, I get this error message once for every image I try to push, the second push for each image works again. **Steps to reproduce the issue:** 1. Have a GitLab project with a GitLab Runner that can build and push Podman images (e.g. via Shell executor or custom executor) 2. Use it very often 3. Sometimes (rarely) get this error during `podman push ...` **Describe the results you received:** `Error: trying to reuse blob sha256:... at destination: Requesting bearer token: invalid status code from registry 404 (Not Found)` **Describe the results you expected:** Successfully pushing the image to the GitLab Container Registry **Additional information you deem important (e.g. issue happens only occasionally):** I have a full log of `podman --log-level=debug push ...` the time it fails. I probably can't post the full log, but if there's something to check in that log, please tell! **Output of `podman version`:** ``` Client: Podman Engine Version: 4.2.1 API Version: 4.2.1 Go Version: go1.18.5 Built: Wed Sep 7 19:58:19 2022 OS/Arch: linux/amd64 ``` **Output of `podman info`:** ``` host: arch: amd64 buildahVersion: 1.27.0 cgroupControllers: - cpuset - cpu - cpuacct - blkio - memory - devices - freezer - net_cls - perf_event - net_prio - hugetlb - pids - rdma cgroupManager: cgroupfs cgroupVersion: v1 conmon: package: conmon-2.1.4-2.fc36.x86_64 path: /usr/bin/conmon version: 'conmon version 2.1.4, commit: ' cpuUtilization: idlePercent: 94.36 systemPercent: 0.85 userPercent: 4.78 cpus: 6 distribution: distribution: fedora variant: container version: "36" eventLogger: file hostname: 04e8b959c867 idMappings: gidmap: null uidmap: null kernel: 5.4.0-126-generic linkmode: dynamic logDriver: k8s-file memFree: 2666459136 memTotal: 4914487296 networkBackend: netavark ociRuntime: name: crun package: crun-1.6-2.fc36.x86_64 path: /usr/bin/crun version: |- crun version 1.6 commit: 18cf2efbb8feb2b2f20e316520e0fd0b6c41ef4d spec: 1.0.0 +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL os: linux remoteSocket: path: /run/podman/podman.sock security: apparmorEnabled: false capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT rootless: false seccompEnabled: true seccompProfilePath: /usr/share/containers/seccomp.json selinuxEnabled: false serviceIsRemote: false slirp4netns: executable: /usr/bin/slirp4netns package: slirp4netns-1.2.0-0.2.beta.0.fc36.x86_64 version: |- slirp4netns version 1.2.0-beta.0 commit: 477db14a24ff1a3de3a705e51ca2c4c1fe3dda64 libslirp: 4.6.1 SLIRP_CONFIG_VERSION_MAX: 3 libseccomp: 2.5.3 swapFree: 4243320832 swapTotal: 4294963200 uptime: 26h 8m 44.00s (Approximately 1.08 days) plugins: authorization: null log: - k8s-file - none - passthrough - journald network: - bridge - macvlan volume: - local registries: search: - registry.fedoraproject.org - registry.access.redhat.com - docker.io - quay.io store: configFile: /etc/containers/storage.conf containerStore: number: 0 paused: 0 running: 0 stopped: 0 graphDriverName: overlay graphOptions: overlay.imagestore: /var/lib/shared overlay.mount_program: Executable: /usr/bin/fuse-overlayfs Package: fuse-overlayfs-1.9-1.fc36.x86_64 Version: |- fusermount3 version: 3.10.5 fuse-overlayfs: version 1.9 FUSE library version 3.10.5 using FUSE kernel interface version 7.31 overlay.mountopt: nodev,fsync=0 graphRoot: /var/lib/containers/storage graphRootAllocated: 40501673984 graphRootUsed: 16835969024 graphStatus: Backing Filesystem: extfs Native Overlay Diff: "false" Supports d_type: "true" Using metacopy: "false" imageCopyTmpDir: /tmp/custom-executor1517200462 imageStore: number: 57 runRoot: /run/containers/storage volumePath: /var/lib/containers/storage/volumes version: APIVersion: 4.2.1 Built: 1662580699 BuiltTime: Wed Sep 7 19:58:19 2022 GitCommit: "" GoVersion: go1.18.5 Os: linux OsArch: linux/amd64 Version: 4.2.1 ``` **Package info (e.g. output of `rpm -q podman` or `apt list podman` or `brew info podman`):** ``` podman/stable from quay.io, run inside another Podman inside a VirtualBox ``` **Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/main/troubleshooting.md)** Yes **Additional environment details (AWS, VirtualBox, physical, etc.):** ``` podman/stable from quay.io, run inside another Podman inside a VirtualBox ```

Relevant quote from @mtrmac in the discussion:

404 when trying to authenticate? Ugh. The ActiveRecord log above is an explanation for how that 404 happens, but it’s still a server-side bug if it reports “this does not exist” and not “server unavailable”, or whatever the cause is…

Right now this error case does not return a Go-detectable error. That can be fixed, of course.

Automatically retrying on a 404… really seems like a bad papering over of a clear problem; if we were building a list of things to never retry on, a 404 would definitely be on it.

I guess accepting the idea of “cosmic-rays-induced failures” and retrying on pretty much everything (except where the case is known to be local and unfixable, like out of disk space) in c/common/pkg/retry might make sense — as long as we only retry once?

I guess accepting the idea of “cosmic-rays-induced failures” and retrying on pretty much everything

I think it would also be useful to be quite loud about it (warning-level logs at least).

I'm planning on raising a matching issue of Gitlab's issue tracker when I'm back in the office next week to get their input.

It's worth noting though that site-wide on gitlab, any time there's an auth / permission restriction (eg trying to view a private project you don't have access to) the server responds with 404. I think this is a valid response choice from a security perspective, in that the server can't "leak" the existence of a resource you don't have permission to view.

That might be why we get a 404 here, if it's some kind of rate limiting on the auth endpoint?

I've raised https://gitlab.com/gitlab-org/gitlab/-/issues/404326 to request some feedback from the gitlab side.

It's worth noting though that site-wide on gitlab, any time there's an auth / permission restriction (eg trying to view a private project you don't have access to) the server responds with 404.

That concept of pretending something doesn’t exist is fine; but the server still needs to follow other constraints of the protocol.

E.g. that other ticket shows that the server returns a 404 on GET /jwt/auth?…. I read that as claiming not that the parameters refer to an inconsistent repo, but that /jwt/auth itself doesn’t exist, and that doesn’t really make sense.

What is the client to do with that information? We could, reasonably, and consistently with that concept, set up this operation to treat 404 (“does not exist”) the same as 403 (unauthorized) — but 403 is explicitly one of the results where we don’t retry because it wouldn’t help and it could hurt (by locking the user out due to repeated authentication failures).

If the server is encountering some kind of outage / downtime / overload, I think it should indicate an outage / downtime / overload; that reveals nothing about the authentication rules to the client, but it allows the client to make a desirable retry decision.

I don't think I have seen this error again after reporting it, although it did occur multiple times before. The only thing I can imagine is that since then, I've regularly updated both GitLab and Podman to the latest version. Potentially it makes sense to track with which Podman and GitLab versions people see this issue.

@mtrmac Regarding I think it would also be useful to be quite loud about it (warning-level logs at least).: For the original bug report I have such a log. As mentioned in the initial bug report, I probably can't post the full log, but if there's something to check in that log, please tell!

Historically, there was https://gitlab.com/gitlab-org/gitlab/-/issues/215715 , where a repo is expected to be auto-created on a first upload, but if it is accessed while it is being auto-created (in this case, with parallel uploads of multiple layers), some of the attempts may fail. I’m not at all sure this is related in any way, but it is a bit suggestive in that maybe failures encountered while setting up a process might not show up afterwards.

The “quite loud warning-level logs” idea was suggested as something to include in the future auto-retry behavior of Podman, so that there is a trace of a thing failing. It wouldn’t directly help with this situation, especially if the failures no longer occur.

In this case, I think the activerecord::RecordNotFound logs in the GitLab ticket are most important data to allow a code fix.

@mtrmac This is an amazing find. I never considered it but yes, when I reported this issue, that was the first push to that image repository and after deleting an image repository and pushing to it again, I again see this issue. A reason why I never considered it is that this is not happening at the beginning of the push transaction, but rather at the end. In any case I can therefore confirm that also for me this happens when repositories are auto-created.

This happens to me almost every time when pushing to a new image registry. The first push fails, but GitLab creates the registry successfully, and the second push works fine.

I can't reproduce it with docker at all, only buildah. Maybe docker has some kind of built-in retry logic?

My workaround is to just try the push twice. So in my .gitlab-ci-yml, I replaced

buildah push --format=v2s2 "$CI_APPLICATION_REPOSITORY:$CI_APPLICATION_TAG"

with

buildah push --format=v2s2 "$CI_APPLICATION_REPOSITORY:$CI_APPLICATION_TAG" || sleep 10; buildah push --format=v2s2 "$CI_APPLICATION_REPOSITORY:$CI_APPLICATION_TAG"

Maybe docker has some kind of built-in retry logic?

I haven't seen Docker attempt 6 simultaneous auth requests for the JWT bearer token prior to the image push. If the root cause is a race condition in gitlabs active record, it's hard to trigger that when there is no race.

A buildah login first might work, instead of the push attempting to populate the bearer token .

A buildah login first might work, instead of the push attempting to populate the bearer token .

I'm pushing to an authenticated registry so I'm already doing a buildah login. That doesn't seem to fix the issue.

I assumed the 404 was because the registry wasn't found (because it wasn't created yet), not because the auth actually failed.

Requesting bearer token: invalid status code from registry 404 (Not Found)

Technically, the authentication/authorization step is failing. But that doesn’t rule out the possibility that a single authentication request would have succeeded, while a series of concurrent ones triggers a failure.

The code could, possibly, track authentication requests in flight, and wait for an existing one to succeed instead of starting a parallel one. I don’t immediately know whether it would avoid this problem; it’s anyway the polite thing to do, against servers that could be rate-limiting authentication attempts (especially if the user provided incorrect credentials). It might even be a slight performance improvement.

I'm pushing to an authenticated registry so I'm already doing a buildah login. That doesn't seem to fix the issue.

I assumed the 404 was because the registry wasn't found (because it wasn't created yet), not because the auth actually failed.

Interesting, I had only seen a 404 in the batch of JWT auth requests to gitlab, not in the requests to the registry.

A friendly reminder that this issue had no activity for 30 days.

Fyi the ticket I raised at gitlab for this issue has been triaged as severity: major so hopefully we'll get some investigating from their end soon

There was also a potential workaround posted that involves building to a different tag name, then retagging to desired tag, though it doesn't make much sense to me as I wouldn't think it'd result in any change to registry api calls: https://gitlab.com/gitlab-org/gitlab/-/issues/404326#note_1401274540

Technically, the authentication/authorization step is failing. But that doesn’t rule out the possibility that a single authentication request would have succeeded, while a series of concurrent ones triggers a failure.

The code could, possibly, track authentication requests in flight, and wait for an existing one to succeed instead of starting a parallel one. I don’t immediately know whether it would avoid this problem; it’s anyway the polite thing to do, against servers that could be rate-limiting authentication attempts (especially if the user provided incorrect credentials). It might even be a slight performance improvement.

I have a draft implementation in https://github.com/containers/image/pull/1968 . Could someone who can reliably reproduce this GitLab failure test a Podman build with that change?

@mtrmac I can reproduce this, if you or someone has a build available i'll give it a try.

If anyone has ideas / information regarding reproducing the issue (even just describing your CI setup that does reproduce it) gitlab are trying to investigate it here: https://gitlab.com/gitlab-org/gitlab/-/issues/404326#note_1587264776

containers / podman

w/ GitLab Registry: "invalid status code from registry 404" #17999

Discussed in https://github.com/containers/podman/discussions/16842