containers / image

Work with containers' images
Apache License 2.0
866 stars 379 forks source link

Fails to pull image from Quay Enterprise with auth and proxy storage enabled #1948

Open TomSweeneyRedHat opened 1 year ago

TomSweeneyRedHat commented 1 year ago

Copying the Description from: https://bugzilla.redhat.com/show_bug.cgi?id=1774279

There's some question about whether or not this is a bug, but it's something for us to keep an eye on going forward. The following is from the cloned bug (https://bugzilla.redhat.com/show_bug.cgi?id=1695236) which is world readable. Most of the comments in 1774279 were "Red Hat Internal", so I've not copied them here to this public board.

Description of problem: Podman does not request the correct image size when a container is pulled from Quay Enteprise configured with Ceph storage backend and storage proxying enabled. Due to this, the pull fails each time. If proxy storage is disabled, podman correctly pulls the image.

Version-Release number of selected component (if applicable): Quay Enterprise 2.9.5 Podman version 1.1.2

How reproducible: always

Steps to Reproduce:

  1. set up Quay Enterprise 2.9.5 with FEATURE_PROXY_STORAGE: true in config.yaml
  2. set up Ceph storage backend in Quay
  3. try and pull any image from Quay Enterprise

Actual results: Podman fails. This is the debug log from the pull:

podman pull --creds "username:password" docker://registry.local:80/username/alpine:latest --log-level debug

DEBU[0000] Initializing boltdb state at /var/lib/containers/storage/libpod/bolt_state.db DEBU[0000] Using graph driver overlay
DEBU[0000] Using graph root /var/lib/containers/storage DEBU[0000] Using run root /var/run/containers/storage
DEBU[0000] Using static dir /var/lib/containers/storage/libpod DEBU[0000] Using tmp dir /var/run/libpod
DEBU[0000] Using volume path /var/lib/containers/storage/volumes DEBU[0000] Set libpod namespace to ""
DEBU[0000] [graphdriver] trying provided driver "overlay" DEBU[0000] overlay test mount with multiple lowers succeeded DEBU[0000] overlay test mount indicated that metacopy is not being used DEBU[0000] backingFs=extfs, projectQuotaSupported=false, useNativeDiff=true, usingMetacopy=false INFO[0000] Found CNI network podman (type=bridge) at /etc/cni/net.d/87-podman-bridge.conflist DEBU[0000] parsed reference into "[overlay@/var/lib/containers/storage+/var/run/containers/storage:overlay.mountopt=nodev]registry.local:80/username/alpine:latest" Trying to pull docker://registry.local:80/username/alpine:latest...DEBU[0000] Using registries.d directory /etc/containers/registries.d for sigstore configuration DEBU[0000] Using "default-docker" configuration
DEBU[0000] No signature storage configuration found for registry.local:80/username/alpine:latest DEBU[0000] Looking for TLS certificates and private keys in /etc/docker/certs.d/registry.local:80 DEBU[0000] Using blob info cache at /var/lib/containers/cache/blob-info-cache-v1.boltdb DEBU[0000] GET https://registry.local:80/v2/
DEBU[0000] Ping https://registry.local:80/v2/ err Get https://registry.local:80/v2/: http: server gave HTTP response to HTTPS client (&url.Error{Op:"Get", URL:"https://registry.local:80/v2/", Err:(*errors.errorString)(0xc000506020)}) DEBU[0000] GET http://registry.local:80/v2/
DEBU[0000] Ping http://registry.local:80/v2/ status 401 DEBU[0000] GET http://registry.local/v2/auth?account=username&scope=repository%3Ausername%2Falpine%3Apull&service=registry.local DEBU[0000] Increasing token expiration to: 60 seconds
DEBU[0000] GET http://registry.local:80/v2/username/alpine/manifests/latest DEBU[0002] IsRunningImageAllowed for image docker:registry.local:80/username/alpine:latest DEBU[0002] Using default policy section
DEBU[0002] Requirement 0: allowed
DEBU[0002] Overall: allowed
Getting image source signatures DEBU[0002] Manifest has MIME type application/vnd.docker.distribution.manifest.v1+prettyjws, ordered candidate list [application/vnd.docker.distribution.manifest.v1+prettyjws, application/vnd.docker.distribution.manifest.v2+json, application/vnd.oci.image.manifest.v1+json, application/vnd.docker.distribution.manifest.v1+json] DEBU[0002] ... will first try using the original manifest unmodified DEBU[0002] Downloading /v2/username/alpine/blobs/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4 DEBU[0002] GET http://registry.local:80/v2/username/alpine/blobs/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4 DEBU[0002] Downloading /v2/username/alpine/blobs/sha256:8e402f1a9c577ded051c1ef10e9fe4492890459522089959988a4852dee8ab2c DEBU[0002] GET http://registry.local:80/v2/username/alpine/blobs/sha256:8e402f1a9c577ded051c1ef10e9fe4492890459522089959988a4852dee8ab2c Copying blob 8e402f1a9c57 [======================================] 1.4GiB / 1.4GiB Copying blob a3ed95caeb02 [======================================] 1.4GiB / 1.4GiB DEBU[0003] Error pulling image ref //registry.local:80/username/alpine:latest: Error reading blob sha256:8e402f1a9c577ded051c1ef10e9fe4492890459522089959988a4852dee8ab2c: Invalid status code returned when fetching blob 400 (Bad Request) Failed (0x1840f00,0xc000246c60)

Expected results: Podman successfully pulls the image from storage.

Additional info: The reason Ceph returns the 400 is because layer size is not correct. For some reason, podman assumes a size of 1.4 GB which is incorrect, since the whole alpine image has roughly 5 MB. Additionally, when the direct URL to the image layer is reverse engineered from Quay's log, curling the URL works and the image can be pulled. The pull also works with Docker CLI, regardless of whether storage is proxied or not.

Ivan Bazulic 2019-04-02 17:07:29 UTC Hardware: Unspecified → x86_64 OS: Unspecified → Linux Severity: unspecified → low Tom McKay 2019-04-03 12:28:09 UTC CC: tomckay@redhat.com Daniel Walsh 2019-04-03 14:06:10 UTC Assignee: lsm5@redhat.com → vrothber@redhat.com Daniel Walsh 2019-04-03 14:06:38 UTC Status: NEW → ASSIGNED CC: mitr@redhat.com Private Comment 1 Miloslav Trmač 2019-04-03 15:06:47 UTC

What do you mean by “requests storage size”?

This sounds unlikely; the GET …/blobs/… is exactly what the code asks for, there is no Range header or anything else specifying any particular size.

Copying blob a3ed95caeb02 [======================================] 1.4GiB / 1.4GiB … For some reason, podman assumes a size of 1.4 GB which is incorrect, since the whole alpine image has roughly 5 MB.

This is only a bug in the progress bar; schema1 manifests don’t include a blob size at all (which is another reason_ why podman can’t be requesting an incorrect one), which causes the progress bar to be created with size -1, and for some reason the progress bar package uses the current timestamp for size — and that works out to about 1.4 GiB nowadays.

The reason Ceph returns the 400 is because layer size is not correct.

Have you actually verified this this on the Ceph side? If so, where is Ceph reading the size from?

Flags: needinfo?(ibazulic@redhat.com) Private Comment 2 Ivan Bazulic 2019-04-08 06:41:24 UTC

Created attachment 1553440 [details] Docker pull logs from Quay

Flags: needinfo?(ibazulic@redhat.com) Private Comment 3 Ivan Bazulic 2019-04-08 06:41:56 UTC

Created attachment 1553441 [details] Podman pull logs from Quay

Private Comment 4 Ivan Bazulic 2019-04-08 06:49:23 UTC

(In reply to Miloslav Trmač from comment #1)

What do you mean by “requests storage size”?

Where did I mention this?

This sounds unlikely; the GET …/blobs/… is exactly what the code asks for, there is no Range header or anything else specifying any particular size.

Fair enough.

The reason Ceph returns the 400 is because layer size is not correct.

Have you actually verified this this on the Ceph side? If so, where is Ceph reading the size from?

Yes. This is a log from the Ceph side when 'docker pull' is invoked:

2019-04-08 02:33:43.444385 7ff405da7700 1 ====== starting new request req=0x7ff405da1650 ===== 2019-04-08 02:33:43.476985 7ff405da7700 1 ====== req done req=0x7ff405da1650 op status=0 http_status=200 ====== 2019-04-08 02:33:43.477052 7ff405da7700 1 civetweb: 0x557a4feb0000: 192.168.58.3 - - [08/Apr/2019:02:33:43 -0400] "GET /my-new-bucket/datastorage/registry/sha256/8e/8e402f1a9c577ded051c1ef10e9fe4492890459522089959988a4852dee8ab2c HTTP/1.0" 200 0 http://ubuntu.tardis.local:80/v2/ibazulic/alpine.latest/blobs/sha256:8e402f1a9c577ded051c1ef10e9fe4492890459522089959988a4852dee8ab2c docker/18.09.4 go/go1.10.8 git-commit/d14af54 kernel/5.0.5-200.fc29.x86_64 os/linux arch/amd64 UpstreamClient(Docker-Client/18.09.4 (linux))

As you can see, the return is 200. Now, take a look at podman request:

2019-04-08 02:37:06.934968 7ff402da1700 1 ====== starting new request req=0x7ff402d9b650 ===== 2019-04-08 02:37:06.935218 7ff402da1700 1 ====== req done req=0x7ff402d9b650 op status=0 http_status=400 ====== 2019-04-08 02:37:06.935256 7ff402da1700 1 civetweb: 0x557a4feba000: 192.168.58.3 - - [08/Apr/2019:02:37:06 -0400] "GET /my-new-bucket/datastorage/registry/sha256/a3/a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4 HTTP/1.0" 400 0 http://ubuntu.tardis.local:80/v2/ibazulic/alpine.latest/blobs/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4 Go-http-client/1.1

It's 400. I have also provided Quay log files from both podman and docker pulls, with Quay debugging on. You'll notice that on both pulls JWT tokens created by Quay are the same, when decoded with jwt.io. I have extensively talked with Joey Schorr about this, since I thought that this was a bug in Quay first (and have opened a JIRA for it). However, after carefully examining the requests received and sent by Quay to Ceph, we believe that the issue is in podman. What's surprising is that the same issues does not happen when podman is pushing to Quay, only pulls are problematic.

If there's anything else I can provide, let me know.

Private Comment 5 Ivan Bazulic 2019-04-08 10:39:06 UTC

What's even more interesting is the fact that Podman requests a totally wrong blob when proxy storage is enabled. I removed the proxy storage from the config file and now Podman pulls the correct blob:

Apr 8 10:35:37 23b6e6933843 gunicorn_registry[1844]: 2019-04-08 10:35:37,332 [6258] [INFO] [gunicorn.access] 192.168.56.1 - - [08/Apr/2019:10:35:37] "GET /v2/ibazulic/alpine.latest/blobs/sha256:8e402f1a9c577ded051c1ef10e9fe4492890459522089959988a4852dee8ab2c HTTP/1.1" 302 671 "-" "Go-http-client/1.1" Apr 8 10:35:37 23b6e6933843 nginx[1893]: 192.168.56.1 () - - [08/Apr/2019:10:35:37 +0000] "GET /v2/ibazulic/alpine.latest/blobs/sha256:8e402f1a9c577ded051c1ef10e9fe4492890459522089959988a4852dee8ab2c HTTP/1.1" 302 671 "-" "Go-http-client/1.1" (0.063 1338 0.060)

Also, Ceph reports a 200 on every request as well:

2019-04-08 05:50:58.804772 7ff40ddb7700 1 ====== starting new request req=0x7ff40ddb1650 ===== 2019-04-08 05:50:58.879994 7ff40ddb7700 1 ====== req done req=0x7ff40ddb1650 op status=0 http_status=200 ====== 2019-04-08 05:50:58.880081 7ff40ddb7700 1 civetweb: 0x557a4fed3000: 192.168.58.1 - - [08/Apr/2019:05:50:58 -0400] "GET /my-new-bucket/datastorage/registry/sha256/8e/8e402f1a9c577ded051c1ef10e9fe4492890459522089959988a4852dee8ab2c HTTP/1.1" 200 0 http://ubuntu.tardis.local:80/v2/ibazulic/alpine.latest/blobs/sha256:8e402f1a9c577ded051c1ef10e9fe4492890459522089959988a4852dee8ab2c Go-http-client/1.1

Private Comment 6 Miloslav Trmač 2019-04-08 13:39:10 UTC

(In reply to Miloslav Trmač from comment #1)

What do you mean by “requests storage size”?

Where did I mention this?

I’m sorry, the actual quote was “does not request the correct image size”. (That is even less applicable because images are not downloaded as whole units in the docker/distribution HTTP API.)

Have you actually verified this this on the Ceph side? If so, where is Ceph reading the size from?

Yes. This is a log from the Ceph side when 'docker pull' is invoked:

2019-04-08 02:33:43.444385 7ff405da7700 1 ====== starting new request req=0x7ff405da1650 ===== 2019-04-08 02:33:43.476985 7ff405da7700 1 ====== req done req=0x7ff405da1650 op status=0 http_status=200 ====== 2019-04-08 02:33:43.477052 7ff405da7700 1 civetweb: 0x557a4feb0000: 192.168.58.3 - - [08/Apr/2019:02:33:43 -0400] "GET /my-new-bucket/datastorage/registry/sha256/8e/ 8e402f1a9c577ded051c1ef10e9fe4492890459522089959988a4852dee8ab2c HTTP/1.0" 200 0 http://ubuntu.tardis.local:80/v2/ibazulic/alpine.latest/blobs/sha256: 8e402f1a9c577ded051c1ef10e9fe4492890459522089959988a4852dee8ab2c docker/18.09.4 go/go1.10.8 git-commit/d14af54 kernel/5.0.5-200.fc29.x86_64 os/linux arch/amd64 UpstreamClient(Docker-Client/18.09.4 (linux))

Nothing in there involves any kind of size determination / check / recorded value AFAICT; all it says is a 400, from some /my-new-bucket/ URL that is internal to the Quay implementation and not directly specified by a docker/distribution client.


(In reply to Ivan Bazulic from comment #5)

What's even more interesting is the fact that Podman requests a totally wrong blob when proxy storage is enabled. I removed the proxy storage from the config file and now Podman pulls the correct blob:

Ah, I think I know a bit about what is going on.

Guessing that the image corresponds with current docker.io/library/alpine:latest, i.e. docker://alpine@sha256:644fcb1a676b5165371437feaa922943aaf7afcfa8bfee4472f6860aad1ef2a0 in skopeo syntax. That image, in the schema2 format, has a single layer blob, 8e4… , and a config blob, 5cb….

The image pulled from your registry is not exactly that original one; it is a schema1 image:

Manifest has MIME type application/vnd.docker.distribution.manifest.v1+prettyjws

The schema2 image has 2 entries in its config.json history field; in schema1, that is represented as two blobs (entries in “fsLayers” and “history”), one of them an empty gzipped tar file. Conventionally, a specific representation with digest sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4 is used for that empty one. That blob is an actual artifact of the schema1 image, and the registry must, AFAIK, be able to serve it just like any other blob (correspondingly, schema1 pushes do actually push that blob to registries).


Another aspect of that is that docker has an optimization not to pull these empty blobs, and podman does not - it pulls them and discards them later, which causes them to be pulled again every time. So, yes, there is a behavior difference between (recent) docker versions and podman in pulling a3ed95…, which causes this to manifest with podman. Still, the a3ed95… layer is an integral part of the image, explicitly listed in the manifest, so it should actually be served by the registry AFAICT (notably nothing makes throwaway layers special in https://github.com/docker/distribution/blob/master/docs/spec/manifest-v2-1.md ).


So, something is processing the a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4 empty tar file incorrectly; I’d guess that the storage layer is treating it specially (maybe not recording it in the backing storage at all, and hard-coding the contents in the serving application?), and that special treatment is somehow missing/broken/different for the storage proxy. (Podman is completely unaware of pull-through/storage proxies; that happens server-side.)

Private Comment 7 Daniel Walsh 2019-04-08 14:04:19 UTC

Mitr, should we add this optimization to Podman?

`

Another aspect of that is that docker has an optimization not to pull these empty blobs, and podman does not - it pulls them and discards them later, which causes them to be pulled again every time. So, yes, there is a behavior difference between (recent) docker versions and podman in pulling a3ed95…, which causes this to manifest with podman. Still, the a3ed95… layer is an integral part of the image, explicitly listed in the manifest, so it should actually be served by the registry AFAICT (notably nothing makes throwaway layers special in https://github.com/docker/distribution/blob/master/docs/spec/manifest-v2-1.md ). `

Private Comment 8 Miloslav Trmač 2019-04-08 14:10:20 UTC

(In reply to Daniel Walsh from comment #7)

Mitr, should we add this optimization to Podman?

Another aspect of that is that docker has an optimization not to pull these empty blobs, and podman does not…

Sure, in principle, but with pretty low priority: this affects only schema1, which has been going away for a long time (notably no multi-arch support), and saves about 1 HTTP round-trip per schema1 image. The storageImageDestination.TryReusingBlob code path is already getting pretty complex (with the recent concurrent copy detection and the like), this empty layer detection should only be added if it does not interfere with those more important features (and, while unlikely, if it does not affect readability significantly).

And if Quay is handling such layers incorrectly, that should be fixed either way.

Private Comment 9 jschorr@redhat.com 2019-04-08 17:35:00 UTC

So, something is processing the a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4 empty tar file incorrectly; I’d guess that the storage layer is treating it specially (maybe not recording it in the backing storage at all, and hard-coding the contents in the serving application?), and that special treatment is somehow missing/broken/different for the storage proxy. (Podman is completely unaware of pull-through/storage proxies; that happens server-side.)

Actually, no. Quay explicitly ensures that the special layer is stored into storage when a schema1 manifest is referenced in order to ensure that it can be pulled later on. Even if the layer was missing, that should cause a 404, not a 400, and it would then definitely fail when storage proxying is turned off and podman is used, since it wouldn't be present.

I think something more is going on here, but I'm unsure what it can be; storage proxying only determines from where the client pulls the blobs; it still serves the same exact blobs from storage.

CC: jschorr@redhat.com Private Comment 10 Miloslav Trmač 2019-04-08 17:57:59 UTC

I mean, somewhere within Quay or its backend services must be a “if $condition { fail with 400 }”; it should always be possible to trace this down down to the actual cause.

Failing that, extending podman with https://github.com/mtrmac/image/tree/http-dump, or an equivalent server-side full HTTP capture might help, because Podman’s docker/distribution API client, on this particular code path, only reports the HTTP status code and discards the response body — but that’s only going to help if the server actually reports the underlying cause in there. Is there a server I can easily access to test that? Or a ready-to-go VM image, or something like that? I’ve never deployed Quay nor Ceph, so setting this up myself would be very time-consuming.

Private Comment 11 jschorr@redhat.com 2019-04-08 17:59:26 UTC

We can provide a container image for running Quay; Ivan B (who reported the issue) can help with getting it setup with this flag enabled.

That being said, the 400 is coming from nginx fronting the call from Ceph, so there has to be some odd interaction going on there.

Private Comment 12 Ivan Bazulic 2019-04-10 12:53:39 UTC

I'm in the process of creating a Quay Ceph cluster on AWS as PoC. I'll update you when I'm done and will send you the info needed to log on.

Private Comment 13 Miloslav Trmač 2019-04-16 19:16:58 UTC

For the record: reproduced, skopeo fails with

GET /v2/admin/alpine/blobs/sha256:bdf0201b3a056acc4d6062cc88cd8a4ad5979983bfb640f15a145e09ed985f92 HTTP/1.1 Host: User-Agent: Go-http-client/1.1 Authorization: Bearer Docker-Distribution-Api-Version: registry/2.0 Accept-Encoding: gzip

HTTP/1.1 400 Bad Request Connection: close Content-Length: 188 Accept-Ranges: bytes Content-Type: application/xml Date: Tue, 16 Apr 2019 18:46:26 GMT Server: nginx/1.13.12 X-Amz-Request-Id: tx000000000000000000053-005cb62302-1023-default

<?xml version="1.0" encoding="UTF-8"?>InvalidArgumenttx000000000000000000053-005cb62302-1023-default1023-default-default

Yet Docker succeeds:

GET /v2/admin/alpine/blobs/sha256:bdf0201b3a056acc4d6062cc88cd8a4ad5979983bfb640f15a145e09ed985f92 HTTP/1.1 Host: User-Agent: docker/1.13.1 go/go1.10.7 kernel/4.19.15-200.fc28.x86_64 os/linux arch/amd64 UpstreamClient(Docker-Client/1.13.1 (linux)) Accept-Encoding: identity Authorization: Bearer Connection: close

HTTP/1.1 302 FOUND Server: nginx/1.13.12 Date: Tue, 16 Apr 2019 18:59:41 GMT Content-Type: text/html; charset=utf-8 Content-Length: 2821 Connection: close Location: http:///_storage_proxy//http/client:7480/quay-bucket/datastorage/registry/sha256/bd/bdf0201b3a056acc4d6062cc88cd8a4ad5979983bfb640f15a145e09ed985f92?Signature=%3D&Expires=1555441781&AWSAccessKeyId= Accept-Ranges: bytes Docker-Content-Digest: sha256:bdf0201b3a056acc4d6062cc88cd8a4ad5979983bfb640f15a145e09ed985f92 Cache-Control: max-age=31536000 X-Frame-Options: DENY

That’s already curious, but changing skopeo to use exactly the same headers still fails:

GET /v2/admin/alpine/blobs/sha256:bdf0201b3a056acc4d6062cc88cd8a4ad5979983bfb640f15a145e09ed985f92 HTTP/1.1 Host: User-Agent: docker/1.13.1 go/go1.10.7 kernel/4.19.15-200.fc28.x86_64 os/linux arch/amd64 UpstreamClient(Docker-Client/1.13.1 (linux)) Accept-Encoding: identity Authorization: Bearer Connection: close

HTTP/1.1 400 Bad Request Connection: close Content-Length: 188 Accept-Ranges: bytes Content-Type: application/xml Date: Tue, 16 Apr 2019 19:05:55 GMT Server: nginx/1.13.12 X-Amz-Request-Id: tx000000000000000000057-005cb62793-1023-default

<?xml version="1.0" encoding="UTF-8"?>InvalidArgumenttx000000000000000000057-005cb62793-1023-default1023-default-default

PnT Account Manager 2019-05-01 21:58:00 UTC CC: fkluknav@redhat.com Private Comment 14 Pavel Vitous 2019-05-09 09:28:50 UTC

Hello,

sorry to hijack the thread, but I've experienced something very similar to this. I'll try to be brief. We have a setup with Quay Enterprise and as a storage we're using S3 backend (temporarily we use Zenko Cloudserver, but same behavior was tested with minio) with proxy mode.

I'll make a remark that with Docker client there aren't any issues. Push/Pull all is working great.

However with Podman/Skopeo(libpod) we run to the problem when we've tried pulling images from the Quay. I was getting 401 responses, while without Proxy mode everything was good. I spent few days digging thru tcpdump captures. Until I noticed that with libpod there was a little bit different behavior to the Docker client. After the client receives HTTP 302 for the storage request, then the subsequent GET from Docker client on the redir address doesn't contain HTTP header "Authorization: Bearer.." and the request is successful. Meanwhile Libpod keeps the Auth header from the past request. This may sound legit, but then the S3 can't handle the request and response with 401. I couldn't test it against AWS S3, not sure if behavior is the same. We plan to use HCP S3 as a backend. If you'd like I can send you the pcap captures for investigation. Simple workaround for our case was to strip Auth headers in nginx at proxy_storage location.

BR, Pavel

CC: vitous.pavel@ifortuna.cz Private Comment 15 Miloslav Trmač 2019-05-09 13:58:42 UTC

Pavel, thanks, that does sound interesting, you are right (Docker uses a custom http.RoundTripper to add credentials specific to a given host, for each request individually, while c/image specifies the credentials in the original request, so the credentials are used within redirects as well).

Still, comment#13 shows a curious behavior difference already when getting the HTTP 302 redirect; the code does not even get to the point of using resulting URL. It’s always possible that I made a mistake when collecting that data.

Private Comment 16 Pavel Vitous 2019-05-10 07:28:55 UTC

Hi Miloslav,

I forgot to mention one important fact, what I've described only happens if you're making connection thru loadbalancer (tested with HaProxy and Nginx in tcpmode). When you go directly to Quay endpoint (skipping the loadbalancer), then it works like a charm.

So connection goes like Client (pull) ----> LB(HaProxy) ----> Quay --(proxy_mode)--> S3.

Btw. I just made a test with CEPH as storage backend and as far I can tell, stripping the Auth header probably hotfixes the issue as well.

This is what I get when I try pull without Auth header strip LB in front of quay: DEBU[0004] Manifest has MIME type application/vnd.docker.distribution.manifest.v1+prettyjws, ordered candidate list [application/vnd.docker.distribution.manifest.v1+prettyjws, application/vnd.docker.distribution.manifest.v2+json, application/vnd.oci.image.manifest.v1+json, application/vnd.docker.distribution.manifest.v1+json] DEBU[0004] ... will first try using the original manifest unmodified DEBU[0004] Downloading /v2/test/minio/blobs/sha256:597d1508cd98cc549ab2200e992422a4d2a08eccf870eed959e48be82c1a016b DEBU[0004] GET http://lb-quay-registry.archer/v2/test/minio/blobs/sha256:597d1508cd98cc549ab2200e992422a4d2a08eccf870eed959e48be82c1a016b DEBU[0004] Downloading /v2/test/minio/blobs/sha256:b19d51ada28e91e7a5395b8ac310b927279dd3ba3d19656ec7e75d78107df1f8 DEBU[0004] GET http://lb-quay-registry.archer/v2/test/minio/blobs/sha256:b19d51ada28e91e7a5395b8ac310b927279dd3ba3d19656ec7e75d78107df1f8 ERRO[0004] Error pulling image ref //lb-quay-registry.archer/test/minio:latest: Error reading blob sha256:b19d51ada28e91e7a5395b8ac310b927279dd3ba3d19656ec7e75d78107df1f8: Invalid status code returned when fetching blob 400 (Bad Request) Failed

And here the same situation but with a workaround: DEBU[0004] Manifest has MIME type application/vnd.docker.distribution.manifest.v1+prettyjws, ordered candidate list [application/vnd.docker.distribution.manifest.v1+prettyjws, application/vnd.docker.distribution.manifest.v2+json, application/vnd.oci.image.manifest.v1+json, application/vnd.docker.distribution.manifest.v1+json] DEBU[0004] ... will first try using the original manifest unmodified DEBU[0004] Downloading /v2/test/minio/blobs/sha256:597d1508cd98cc549ab2200e992422a4d2a08eccf870eed959e48be82c1a016b DEBU[0004] GET http://lb-quay-registry.archer/v2/test/minio/blobs/sha256:597d1508cd98cc549ab2200e992422a4d2a08eccf870eed959e48be82c1a016b DEBU[0004] Downloading /v2/test/minio/blobs/sha256:b19d51ada28e91e7a5395b8ac310b927279dd3ba3d19656ec7e75d78107df1f8 DEBU[0004] GET http://lb-quay-registry.archer/v2/test/minio/blobs/sha256:b19d51ada28e91e7a5395b8ac310b927279dd3ba3d19656ec7e75d78107df1f8 DEBU[0005] Detected compression format gzip
DEBU[0005] Using original blob without modification
Copying blob 597d1508cd98 [======================================] 1.5GiB / 1.5GiB DEBU[0005] Detected compression format gzip

Without LB it works in both situations.

Pavel

Private Comment 17 Valentin Rothberg 2019-07-18 08:27:46 UTC

Note that the initial issue (i.e., displaying of incorrect size) in this bug has been fixed upstream: https://github.com/containers/image/commit/2178abd5f9b1b9e70c0aca05780d1defba350a53

It will soon be vendored into Podman and be released.

If you want the other cross-posted issues addressed, please open a dedicated bug.

Private Comment 18 Daniel Walsh 2019-07-18 10:25:48 UTC

Ok I will mark this as fixed in podman-1.5

Fixed In Version: podman-1.5 Private Comment 19 Miloslav Trmač 2019-07-18 15:05:17 UTC

The primary issue is a 400 status returned, for a completely unknown reason. That has not been diagnosed to date.

(FWIW, the bearer authorization behavior difference reported in comment#14 is now tracked upstream in https://github.com/containers/image/issues/641 .)

Private Comment 20 Valentin Rothberg 2019-07-19 07:46:21 UTC

@Miloslav, good point. Would you mind looking into that? I have no expertise on the protocol level.

Assignee: vrothber@redhat.com → mitr@redhat.com Private Comment 21 Miloslav Trmač 2019-07-19 21:05:08 UTC

I can’t really think of any way to diagnose the cause of the 400 status, given that two people have observed different outputs with the same input. My best guess is that this needs to be diagnosed by tracing inside the Quay server. (A distinctly second guess is that this is actually caused by comment#14 , and we have both not noticed the difference.)

I guess, marking NEEDINFO for diagnosing the cause server-side.

Flags: needinfo?(ibazulic@redhat.com) Private Comment 22 Tom McKay 2019-07-22 17:24:31 UTC

I will take this on to test quay w/ both podman and skopeo. Skopeo will be used for repository mirroring and so confirming this works is a high priority.

Private Comment 23 Daniel Walsh 2019-08-14 10:35:29 UTC

Tom I believe I can still mark this as fixed in podman 1.5 correct?

Private Comment 24 Ivan Bazulic 2019-10-23 12:27:18 UTC

It is not fixed. Podman version 1.6 is still inhabiting the same problem:

root@fedora:~# podman --log-level=debug pull --creds ibazulic:Budala123$ docker://debian.tardis.local/ibazulic/alpine:latest ... DEBU[0000] GET https://debian.tardis.local/v2/
DEBU[0000] Ping https://debian.tardis.local/v2/ err Get https://debian.tardis.local/v2/: dial tcp 192.168.20.100:443: connect: connection refused (&url.Error{Op:"Get", URL:"https://debian.tardis.local/v2/", Err:(*net.OpError)(0xc0003da0a0)}) DEBU[0000] GET http://debian.tardis.local/v2/
DEBU[0000] Ping http://debian.tardis.local/v2/ status 401 DEBU[0000] GET http://debian.tardis.local/v2/auth?account=ibazulic&scope=repository%3Aibazulic%2Falpine%3Apull&service=debian.tardis.local DEBU[0000] Increasing token expiration to: 60 seconds
DEBU[0000] GET http://debian.tardis.local/v2/ibazulic/alpine/manifests/latest DEBU[0000] Using blob info cache at /var/lib/containers/cache/blob-info-cache-v1.boltdb DEBU[0000] IsRunningImageAllowed for image docker:debian.tardis.local/ibazulic/alpine:latest DEBU[0000] Using default policy section
DEBU[0000] Requirement 0: allowed
DEBU[0000] Overall: allowed
DEBU[0000] Downloading /v2/ibazulic/alpine/blobs/sha256:961769676411f082461f9ef46626dd7a2d1e2b2a38e6a44364bcbecf51e66dd4 DEBU[0000] GET http://debian.tardis.local/v2/ibazulic/alpine/blobs/sha256:961769676411f082461f9ef46626dd7a2d1e2b2a38e6a44364bcbecf51e66dd4 DEBU[0000] Error pulling image ref //debian.tardis.local/ibazulic/alpine:latest: Error parsing image configuration: Invalid status code returned when fetching blob 400 (Bad Request) Invalid status code returned when fetching blob 400 (Bad Request) ERRO[0000] error pulling image "docker://debian.tardis.local/ibazulic/alpine:latest": unable to pull docker://debian.tardis.local/ibazulic/alpine:latest: unable to pull image: Error parsing image configuration: Invalid status code returned when fetching blob 400 (Bad Request)

Quay logs the following:

nginx stdout | 192.168.20.1 () - - [23/Oct/2019:12:12:50 +0000] "GET /_storage_proxy/eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6IjQyODQzYTEwYmJiZTk1NTJmMWFhNzRmMzE0YTU1OGUxODM5ZWQyOWNiZjQ4MWM3ZTNhNzk4NWRkMTRhNGVmZjkifQ.eyJhY2Nlc3MiOlt7Imhvc3QiOiJjZXBoLWNsaWVudDo3NDgwIiwic2NoZW1lIjoiaHR0cCIsInR5cGUiOiJzdG9yYWdlcHJveHkiLCJ1cmkiOiJteS1uZXctYnVja2V0L2RhdGFzdG9yYWdlL3JlZ2lzdHJ5L3NoYTI1Ni85Ni85NjE3Njk2NzY0MTFmMDgyNDYxZjllZjQ2NjI2ZGQ3YTJkMWUyYjJhMzhlNmE0NDM2NGJjYmVjZjUxZTY2ZGQ0P1NpZ25hdHVyZT1EemM0Q0lHODY2UnJIU3JTMkdta0I4Rmo4U3MlM0QmRXhwaXJlcz0xNTcxODMzMzcwJkFXU0FjY2Vzc0tleUlkPUs2RFROSDg1Mk5ZRENETlJTTVRGIn1dLCJjb250ZXh0Ijp7fSwiYXVkIjoiZGViaWFuLnRhcmRpcy5sb2NhbCIsImV4cCI6MTU3MTgzMjgwMCwiaXNzIjoicXVheSIsImlhdCI6MTU3MTgzMjc3MCwibmJmIjoxNTcxODMyNzcwLCJzdWIiOiJzdG9yYWdlcHJveHkifQ.e6-lAOUHZdcRVR8j04Q72H0oOQbStRJXIe-tLPsnvBPsjEDlWgw3bLFQrxeDm74B0LAfO7niF2naEsedIL3M3mwc5G97hIbnfMb_DOIojKyCvfdGTYLR8AHfi0kaNTdm0JAZraMosxGmPJ1WhokCC8ckmJfcnemmBT4Wv0YKFyscvXsHrRxH60RMW0_YUstZZ9JPYbKgh62KP-AY0C6tTKZj5xh6qDFvcoNGbN2Dlr8SQok0MVVBJqK2OK4nQeMRdMxxTrGzKj_Gc91tNoPf8vJotaQ3Q9_UdCiRhqziOLi2qtrFHli7QlVxnn8M0l6G6xJzeJMcsDSTPwddJlHcag/http/ceph-client:7480/my-new-bucket/datastorage/registry/sha256/96/961769676411f082461f9ef46626dd7a2d1e2b2a38e6a44364bcbecf51e66dd4?Signature=Dzc4CIG866RrHSrS2GmkB8Fj8Ss%3D&Expires=1571833370&AWSAccessKeyId=K6DTNH852NYDCDNRSMTF HTTP/1.1" 400 188 "http://debian.tardis.local/v2/ibazulic/alpine/blobs/sha256:961769676411f082461f9ef46626dd7a2d1e2b2a38e6a44364bcbecf51e66dd4" "libpod/1.6.1" (0.043 2621 0.002)

Ceph RADOS logs the following:

2019-10-23 13:12:50.522295 7fc1f840b700 1 ====== starting new request req=0x7fc1f8405650 ===== 2019-10-23 13:12:50.522529 7fc1f840b700 1 ====== req done req=0x7fc1f8405650 op status=0 http_status=400 ====== 2019-10-23 13:12:50.522572 7fc1f840b700 1 civetweb: 0x55bf79804000: 192.168.20.100 - - [23/Oct/2019:13:12:50 +0100] "GET /my-new-bucket/datastorage/registry/sha256/96/961769676411f082461f9ef46626dd7a2d1e2b2a38e6a44364bcbecf51e66dd4 HTTP/1.0" 400 0 http://debian.tardis.local/v2/ibazulic/alpine/blobs/sha256:961769676411f082461f9ef46626dd7a2d1e2b2a38e6a44364bcbecf51e66dd4 libpod/1.6.1

This is with FEATURE_PROXY_STORAGE: true. The same does not happen when the storage proxy feature is turned off. In that case, podman pulls work as expected. Same error happens when FEATURE_PROXY_STORAGE: true and the supporting storage backend is AWS, so it's not connected to Ceph directly. Again, the same does not happen when Docker is used. Proxying storage is only used with pulls, pushes work as intended.

Flags: needinfo?(ibazulic@redhat.com) Private Comment 25 Tom McKay 2019-10-29 14:54:34 UTC

It would be good to resolve this issue as it still fails on podman-1.6.2.

From our debug it appears as if podman is resending the wrong headers. We can provide a quay setup to work with. Docker client works successfully as expected.

Flags: needinfo?(mitr@redhat.com) Private Comment 26 Daniel Walsh 2019-10-29 15:12:11 UTC

We need more information from the quay server.

Miloslav asked for this here. https://bugzilla.redhat.com/show_bug.cgi?id=1695236#c21

Private Comment 27 jschorr@redhat.com 2019-10-29 15:20:15 UTC

I'm nearly 100% certain that the root cause is that listed here: https://bugzilla.redhat.com/show_bug.cgi?id=1695236#c14

Can someone try modifying podman to turn off the "resend the Auth header" behavior and see if it is fixes the problem?

Derrick Ornelas 2019-10-30 20:25:25 UTC CC: dornelas@redhat.com Private Comment 28 Ben Cotton 2019-10-31 18:42:50 UTC

This message is a reminder that Fedora 29 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora 29 on 2019-11-26. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '29'.

Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 29 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.

Ivan Bazulic 2019-11-08 17:15:06 UTC Version: 29 → 31 Derrick Ornelas 2019-11-19 23:38:59 UTC Blocks: Red Hat1774279 Private Comment 29 svelosol@redhat.com 2019-12-04 08:52:33 UTC

This bug makes impossible to use Openshift 4 with Quay in proxy mode as cri-o uses same library than Podman.

The workaround proposed by @Pavel works fine but it should not be handled by Quay nginx.

location ~ ^/_storage_proxy/([^/]+)/([^/]+)/([^/]+)/(.+) { ...

Fix Podman issue https://bugzilla.redhat.com/show_bug.cgi?id=1695236

proxy_set_header Authorization "";
...

}

CC: svelosol@redhat.com Private Comment 30 jschorr@redhat.com 2019-12-04 16:27:12 UTC

@svelosol@redhat.com

This change has already been merged in the Quay code for the next release, as it was becoming an issue for our customers. You are correct, though, that should ideally be handled on the podman side.

mtrmac commented 1 year ago

This is very likely a duplicate of #641, but determining that would require a careful re-read of both.