Open vrothberg opened 1 year ago
Relevant quote from @mtrmac in the discussion:
404 when trying to authenticate? Ugh. The ActiveRecord log above is an explanation for how that 404 happens, but it’s still a server-side bug if it reports “this does not exist” and not “server unavailable”, or whatever the cause is…
Right now this error case does not return a Go-detectable error. That can be fixed, of course.
Automatically retrying on a 404… really seems like a bad papering over of a clear problem; if we were building a list of things to never retry on, a 404 would definitely be on it.
I guess accepting the idea of “cosmic-rays-induced failures” and retrying on pretty much everything (except where the case is known to be local and unfixable, like out of disk space) in c/common/pkg/retry might make sense — as long as we only retry once?
I guess accepting the idea of “cosmic-rays-induced failures” and retrying on pretty much everything
I think it would also be useful to be quite loud about it (warning-level logs at least).
I'm planning on raising a matching issue of Gitlab's issue tracker when I'm back in the office next week to get their input.
It's worth noting though that site-wide on gitlab, any time there's an auth / permission restriction (eg trying to view a private project you don't have access to) the server responds with 404. I think this is a valid response choice from a security perspective, in that the server can't "leak" the existence of a resource you don't have permission to view.
That might be why we get a 404 here, if it's some kind of rate limiting on the auth endpoint?
I've raised https://gitlab.com/gitlab-org/gitlab/-/issues/404326 to request some feedback from the gitlab side.
It's worth noting though that site-wide on gitlab, any time there's an auth / permission restriction (eg trying to view a private project you don't have access to) the server responds with 404.
That concept of pretending something doesn’t exist is fine; but the server still needs to follow other constraints of the protocol.
E.g. that other ticket shows that the server returns a 404 on GET /jwt/auth?…
. I read that as claiming not that the parameters refer to an inconsistent repo, but that /jwt/auth
itself doesn’t exist, and that doesn’t really make sense.
What is the client to do with that information? We could, reasonably, and consistently with that concept, set up this operation to treat 404 (“does not exist”) the same as 403 (unauthorized) — but 403 is explicitly one of the results where we don’t retry because it wouldn’t help and it could hurt (by locking the user out due to repeated authentication failures).
If the server is encountering some kind of outage / downtime / overload, I think it should indicate an outage / downtime / overload; that reveals nothing about the authentication rules to the client, but it allows the client to make a desirable retry decision.
I don't think I have seen this error again after reporting it, although it did occur multiple times before. The only thing I can imagine is that since then, I've regularly updated both GitLab and Podman to the latest version. Potentially it makes sense to track with which Podman and GitLab versions people see this issue.
@mtrmac Regarding I think it would also be useful to be quite loud about it (warning-level logs at least).
: For the original bug report I have such a log. As mentioned in the initial bug report, I probably can't post the full log, but if there's something to check in that log, please tell!
Historically, there was https://gitlab.com/gitlab-org/gitlab/-/issues/215715 , where a repo is expected to be auto-created on a first upload, but if it is accessed while it is being auto-created (in this case, with parallel uploads of multiple layers), some of the attempts may fail. I’m not at all sure this is related in any way, but it is a bit suggestive in that maybe failures encountered while setting up a process might not show up afterwards.
The “quite loud warning-level logs” idea was suggested as something to include in the future auto-retry behavior of Podman, so that there is a trace of a thing failing. It wouldn’t directly help with this situation, especially if the failures no longer occur.
In this case, I think the activerecord::RecordNotFound
logs in the GitLab ticket are most important data to allow a code fix.
@mtrmac This is an amazing find. I never considered it but yes, when I reported this issue, that was the first push to that image repository and after deleting an image repository and pushing to it again, I again see this issue. A reason why I never considered it is that this is not happening at the beginning of the push transaction, but rather at the end. In any case I can therefore confirm that also for me this happens when repositories are auto-created.
This happens to me almost every time when pushing to a new image registry. The first push fails, but GitLab creates the registry successfully, and the second push works fine.
I can't reproduce it with docker
at all, only buildah
. Maybe docker has some kind of built-in retry logic?
My workaround is to just try the push twice. So in my .gitlab-ci-yml, I replaced
buildah push --format=v2s2 "$CI_APPLICATION_REPOSITORY:$CI_APPLICATION_TAG"
with
buildah push --format=v2s2 "$CI_APPLICATION_REPOSITORY:$CI_APPLICATION_TAG" || sleep 10; buildah push --format=v2s2 "$CI_APPLICATION_REPOSITORY:$CI_APPLICATION_TAG"
Maybe docker has some kind of built-in retry logic?
I haven't seen Docker attempt 6 simultaneous auth requests for the JWT bearer token prior to the image push. If the root cause is a race condition in gitlabs active record, it's hard to trigger that when there is no race.
A buildah login
first might work, instead of the push
attempting to populate the bearer token .
A buildah login first might work, instead of the push attempting to populate the bearer token .
I'm pushing to an authenticated registry so I'm already doing a buildah login
. That doesn't seem to fix the issue.
I assumed the 404 was because the registry wasn't found (because it wasn't created yet), not because the auth actually failed.
Requesting bearer token: invalid status code from registry 404 (Not Found)
Technically, the authentication/authorization step is failing. But that doesn’t rule out the possibility that a single authentication request would have succeeded, while a series of concurrent ones triggers a failure.
The code could, possibly, track authentication requests in flight, and wait for an existing one to succeed instead of starting a parallel one. I don’t immediately know whether it would avoid this problem; it’s anyway the polite thing to do, against servers that could be rate-limiting authentication attempts (especially if the user provided incorrect credentials). It might even be a slight performance improvement.
I'm pushing to an authenticated registry so I'm already doing a
buildah login
. That doesn't seem to fix the issue.I assumed the 404 was because the registry wasn't found (because it wasn't created yet), not because the auth actually failed.
Interesting, I had only seen a 404 in the batch of JWT auth requests to gitlab, not in the requests to the registry.
A friendly reminder that this issue had no activity for 30 days.
Fyi the ticket I raised at gitlab for this issue has been triaged as severity: major so hopefully we'll get some investigating from their end soon
There was also a potential workaround posted that involves building to a different tag name, then retagging to desired tag, though it doesn't make much sense to me as I wouldn't think it'd result in any change to registry api calls: https://gitlab.com/gitlab-org/gitlab/-/issues/404326#note_1401274540
Technically, the authentication/authorization step is failing. But that doesn’t rule out the possibility that a single authentication request would have succeeded, while a series of concurrent ones triggers a failure.
The code could, possibly, track authentication requests in flight, and wait for an existing one to succeed instead of starting a parallel one. I don’t immediately know whether it would avoid this problem; it’s anyway the polite thing to do, against servers that could be rate-limiting authentication attempts (especially if the user provided incorrect credentials). It might even be a slight performance improvement.
I have a draft implementation in https://github.com/containers/image/pull/1968 . Could someone who can reliably reproduce this GitLab failure test a Podman build with that change?
@mtrmac I can reproduce this, if you or someone has a build available i'll give it a try.
If anyone has ideas / information regarding reproducing the issue (even just describing your CI setup that does reproduce it) gitlab are trying to investigate it here: https://gitlab.com/gitlab-org/gitlab/-/issues/404326#note_1587264776
Discussed in https://github.com/containers/podman/discussions/16842