Recently, on change https://review.opendev.org/c/opendev/system-config/+/806448 our CD was "promoting" (i.e. tagging containers previously uploaded and verified as the latest and cleaning up old tags) a number of new containers. We saw 500 errors from this endpoint a number of times during this process on three different jobs:
2021-08-29 01:49:23.495635 "message": "failed to update 1 tag(s) and first error msg: updating images in DB failed tag=change_806448_3.9 namespace=opendevorg repository=python-base: getRepoTagID in UpdateRepositoryTagImages query error namespace=opendevorg repository=python-base tag=change_806448_3.9: dbr: not found"
...
2021-08-29 01:49:23.496220 "url": "https://hub.docker.com/v2/repositories/opendevorg/python-base/tags?page_size=1000",
Note all these were happening in parallel on different hosts. Timestamps are in UTC.
The full collection of attempts is at https://zuul.opendev.org/t/openstack/buildset/e9aa11951b2a4da49b45e7da3df7a811. We have converted our python-base-3.X image into separate python-base-3.X-<buster|bullseye> images and you can see each job "promoting" it's image. We were running this same process for all these images and only these 3 returned the 500 error. So it feels like a race condition.
I believe we may have seen this before; although I haven't chased down logs to find examples. We can backoff and retry on 500 errors but I'm sure there is some better solution. Pretty much all the client-side logs are available in the links above, but please let us know if there is anything else we could do to help diagnose this issue.
We are clearing up our old issues and your ticket has been open for 6 months with no activity. Remove stale label or comment or this will be closed in 15 days.
Hi,
I could not find anything searching for this specific error message detailed below.
Our CD pipeline uploads containers and deletes old tags; the actual code is https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/promote-docker-image/tasks/promote-cleanup.yaml#L1 and to be specific, the call we have seen failing is the request below
Recently, on change https://review.opendev.org/c/opendev/system-config/+/806448 our CD was "promoting" (i.e. tagging containers previously uploaded and verified as the latest and cleaning up old tags) a number of new containers. We saw 500 errors from this endpoint a number of times during this process on three different jobs:
https://zuul.opendev.org/t/openstack/build/6909d950f40f4ab793f6b46d9c314985 @ task failure
https://zuul.opendev.org/t/openstack/build/cc453830c96a4dd98f3b3ecb4db9e026 @ task failure
https://zuul.opendev.org/t/openstack/build/a18b34c5019c4831a9ac363cf710ee87 @ task failure
Note all these were happening in parallel on different hosts. Timestamps are in UTC.
The full collection of attempts is at https://zuul.opendev.org/t/openstack/buildset/e9aa11951b2a4da49b45e7da3df7a811. We have converted our
python-base-3.X
image into separatepython-base-3.X-<buster|bullseye>
images and you can see each job "promoting" it's image. We were running this same process for all these images and only these 3 returned the 500 error. So it feels like a race condition.I believe we may have seen this before; although I haven't chased down logs to find examples. We can backoff and retry on 500 errors but I'm sure there is some better solution. Pretty much all the client-side logs are available in the links above, but please let us know if there is anything else we could do to help diagnose this issue.