GoogleCloudPlatform / gsutil

A command line tool for interacting with cloud storage services.
Apache License 2.0
876 stars 335 forks source link

gsutil cp -Z always force adds `Cache-control: no-transform` and `Content-Encoding: gzip`. Breaks http protocol #480

Open nojvek opened 7 years ago

nojvek commented 7 years ago

Curl client should be able to receive the unzipped version, but GCS always returns content-encoding gzip. This breaks HTTP/1.1 protocol since the client didn't give "Accept-Encoding: gzip, deflate, br" header.

$ gsutil -m -h "Cache-Control: public,max-age=31536000" cp -Z foo.txt gs://somebucket/foo.txt
Copying file://foo.txt [Content-Type=text/plain]...
- [1/1 files][   42.0 B/   12.0 B] 100% Done
Operation completed over 1 objects/12.0 B.

$ curl -v somebucket.io/foo.txt
> GET /foo1.txt HTTP/1.1
> User-Agent: curl/7.37.0
> Host: somebucket.io
> Accept: */*
> 
< HTTP/1.1 200 OK
< X-GUploader-UploadID: ...
< Date: Thu, 19 Oct 2017 18:04:05 GMT
< Expires: Fri, 19 Oct 2018 18:04:05 GMT
< Last-Modified: Thu, 19 Oct 2017 18:03:47 GMT
< ETag: "c35fdf2f0c2dcadc46333b0709c87e64"
< x-goog-generation: 1508436227151587
< x-goog-metageneration: 1
< x-goog-stored-content-encoding: gzip
< x-goog-stored-content-length: 42
< Content-Type: text/plain
< Content-Encoding: gzip
< x-goog-hash: crc32c=V/9tDw==
< x-goog-hash: md5=w1/fLwwtytxGMzsHCch+ZA==
< x-goog-storage-class: MULTI_REGIONAL
< Accept-Ranges: bytes
< Content-Length: 42
< Access-Control-Allow-Origin: *
* Server UploadServer is not blacklisted
< Server: UploadServer
< Age: 2681
< Cache-Control: public,max-age=31536000,no-transform
<                                                                                                                                                                                   ��Y�tmpG1oc6S�H���W(�/�I���9�

Seems to be happening here

https://github.com/GoogleCloudPlatform/gsutil/blob/e8154bab37ad896b1e1ab01f452ac3284c7051d4/gslib/copy_helper.py#L1741-L1759

nojvek commented 7 years ago

@houglum ^

houglum commented 7 years ago

This behavior (ignoring the Accept-Encoding header) is documented here: https://cloud.google.com/storage/docs/transcoding#decompressive_transcoding

If the Cache-Control metadata field for the object is set to no-transform, the object is served as a compressed object in all subsequent requests, regardless of any Accept-Encoding request headers.

...although it also seems like it would be helpful for us to mention this (along with the fact that we apply the no-transform cache-control directive) in the docs for the -z option.

nojvek commented 7 years ago
If the request for the object includes an Accept-Encoding: gzip header, the object is served as-is in that specific request, along with a Content-Encoding: gzip response header.
If the Cache-Control metadata field for the object is set to no-transform, the object is served as a compressed object in all subsequent requests, regardless of any Accept-Encoding request headers.

Basically what I am saying is, there should be a way to turn off the no-transform that -Z/z adds. It's too aggressive and breaks clients that don't understand gzip. I understand the no-transform is used for integrity check, but the official gsutil client can always ask with Accept-Content: gzip and do an integrity check.

no-transform on -z seems like an implementation logic that is having a side effect. It essentially makes it an unwise choice to use it in a production environment because it breaks the HTTP protocol between server and client.

houglum commented 7 years ago

+@thobrla for comment, as he added this in 439573e and likely has more context.

thobrla commented 7 years ago

If we remove no-transform, it's possible that integrity checking will be impossible for doubly compressed objects, since GCS may remove a layer of compression prior to sending the object even when Accept-Encoding:gzip is provided. This would in turn cause the stored MD5 not to match an MD5 computed on the received bytes regardless of the headers provided by the client.

So if we add such an option to drop no-transform, we're back in the situation we were in before https://github.com/GoogleCloudPlatform/gsutil/commit/439573e7266d8e309f9a1d0364fa91379e3a7b21 where certain files uploaded by gsutil cannot then be downloaded by gsutil, and this seems worse than not being downloadable by a different client.

To put it differently, I cannot see a way to author a fully compatible solution with GCS's current behavior.

As a workaround, you can remove cache-control: no-transform on such objects using the gsutil setmeta command. Would that work for your use case?

nojvek commented 7 years ago

GCS may remove a layer of compression prior to sending the object even when Accept-Encoding:gzip is provided.

Gcs may remove? I'm not sure I fully comprehend. Isn't this behavior deterministic? If I upload gzipped and ask for gzipped, gcs should always give me gzipped right?

It seems you are adding a no-transform header to the object as a marker that object was uploaded gzipped. when what you want is a header that communicates that the md5 hash is that of the compressed content and not the actual content e.g a header like 'store-format: gzip' vs 'store-format: raw'. No-transform seems to be a hack that gets the job but with serious side effects.

Anyways we'll try to workaround with a set-meta to remove the no-cache-control flag after doing the uploads.

I do hope that you won't close this as "won't fix" but work towards a better abstraction.

thobrla commented 7 years ago

Take a look at the second paragraph of Using gzip on compressed objects; if you upload gzipped and ask for gzipped and GCS considers the content type to be incompressible, it will remove the encoding regardless of your request. Then it will serve bytes that will not match the MD5 stored in the object's metadata. I think there is a core issue with the service in that GCS does not publish the content-types that it considers to be incompressible; as such that list is also subject to change.

I agree there are serious side-effects to using no-transform as an approach; we decided on this as a compromise in gsutil because most modern clients can accept gzip encoding. I think unless that issue in the GCS service is addressed, we won't be able to arrive at a clean solution.

nojvek commented 7 years ago

So you're essentially saying GCS will tamper with my data when storing based on an undocumented process that even some of the google cloud team doesn't know about.

Do you know where I can file an issue for the root cause? This seems like bad design on so many levels. I would expect GCS to just be a dumb store of bytes and follow the content-encoding: gzip http spec.

lotten commented 7 years ago

Just to be clear, GCS will never touch the stored data, this is exclusively about the encoding when sending it over the wire.

mikeheme commented 6 years ago

@thobrla For my use case, the behavior of always getting gzipped content when the object's Cache-Control is set to 'no-transform' works fine for optimized web serving.

However, I did some extra tinkering and removed "no-transform" from Cache-Control as you suggested, but that causes another issue, the expected behavior should be that the server respects the "Accept-Encoding" header, if 'gzip' is included it should return gzipped content (no decompressive transcoding, serve file as is stored), if no "Accept-Encoding: gzip” request header is included it should DO decompressive transcoding as documented (right?), in both cases it should respond with according "Content-Encoding" header. BUT it appears that GCS ignores the “Accept-Encoding” request header and always does descompressive transcoding.

If the request for the object includes an Accept-Encoding: gzip header, the object is served as-is in that specific request, along with a Content-Encoding: gzip response header.

For example, the following file has the following metadata in GCS: https://storage.googleapis.com/cedar-league-184821.appspot.com/1/2017/11c/CT-jpg-CE-gzip-CC-empty.jpg Content-Type: image/jpeg Content-Encoding: gzip Cache-Control:

When requesting file with request header 'Accept-Encoding: gzip’, the server doesn’t respond with header "ContentEncoding: gzip" and the image is NOT compressed/gzipped, therefore, it forces decompressive transcoding incorrectly, notice the header ‘Warning: 214 UploadServer gunzipped’ which I suppose is how google informs clients that it actually did decompression.

With -H “Accept-Encoding: gzip” :

# removed unnecessary lines to save space
curl -v -H 'Accept-Encoding: gzip' "https://storage.googleapis.com/cedar-league-184821.appspot.com/1/2017/11c/CT-jpg-CE-gzip-CC-empty.jpg" > should_be_compressed_image.jpg

> GET /cedar-league-184821.appspot.com/1/2017/11c/FullSizeRender-1.jpg HTTP/1.1
> Host: storage.googleapis.com
> User-Agent: curl/7.54.0
> Accept: */*
> Accept-encoding: gzip
>
< HTTP/1.1 200 OK
< X-GUploader-UploadID: AEnB2UpLqw1hkA5MngrLr70nY3nBRZTAmG_432r5LaRipy7nKN4vVzoWlCSoW2220v1tER_10RQ-jMFF7h3tndchkWwVVT46nA
< x-goog-generation: 1510118544142300
< x-goog-metageneration: 2
< x-goog-stored-content-encoding: gzip
< x-goog-stored-content-length: 315737
< Content-Type: image/jpeg
< Content-Language: en
< x-goog-hash: crc32c=/tU2vQ==
< x-goog-hash: md5=BkUrq+p+go4s4q1dvK4O3w==
< x-goog-storage-class: STANDARD
< Warning: 214 UploadServer gunzipped
< Content-Length: 319480
< Server: UploadServer
< Cache-Control: public, max-age=3600
< Age: 1808

Without -H “Accept-Encoding: gzip” :

curl -v "https://storage.googleapis.com/cedar-league-184821.appspot.com/1/2017/11c/CT-jpg-CE-gzip-CC-empty.jpg" > should_be_decompressed_image.jpg
> GET /cedar-league-184821.appspot.com/1/2017/11c/CT-jpg-CE-gzip-CC-empty.jpg HTTP/1.1
> Host: storage.googleapis.com
> User-Agent: curl/7.54.0
> Accept: */*
>
< HTTP/1.1 200 OK
< X-GUploader-UploadID: AEnB2Uq6Ztc3zslZqP8uoBlZPzIS1l92hGTfJjdwIhP3t5V1j2ll6sFRhj3vlqVndnlgcKZM82fjskm1tNWd5N9i1V1E-qsk6sswtoqxB8V0_PJ_lA11fB4
< x-goog-generation: 1510130246520360
< x-goog-metageneration: 1
< x-goog-stored-content-encoding: gzip
< x-goog-stored-content-length: 315737
< Content-Type: image/jpeg
< Content-Language: en
< x-goog-hash: crc32c=/tU2vQ==
< x-goog-hash: md5=BkUrq+p+go4s4q1dvK4O3w==
< x-goog-storage-class: STANDARD
< Warning: 214 UploadServer gunzipped
< Content-Length: 319480
< Server: UploadServer
< Cache-Control: public, max-age=3600
< Age: 1056
thobrla commented 6 years ago

Thanks for the detailed reproduction of the issue. I'm discussing with the GCS team internally and we may be able to fix the service from removing a layer of compression even when Accept-Encoding: gzip is present. If the fix works, it will remove the need for gsutil to add Cache-Control:no-transform.

I'll let you know when I have more details.

nojvek commented 6 years ago

Thanks @thobria. Really appreciate this getting fixed from the root cause.

mikeheme commented 6 years ago

awesome @thobrla! thanks!

mikeheme commented 6 years ago

@thobrla any updates on this?

thobrla commented 6 years ago

Update: the work to stop the GCS service from unnecessarily removing a layer of compression is understood but larger-effort than GCS team originally thought. Part of that work is complete, but finishing the remainder isn't on the team's priorities for the near future.

Until that changes, we'll have to live with this behavior in clients. I think the Cache-Control behavior of gsutil is the best default (given that it can be dsiabled with setmeta if necessary).

Leaving this issue open to track fixing this if GCS service implements the fix.

nojvek commented 6 years ago

Not sure if the GSC team also has a github repo or a public bug tracker. If possible, would be nice to have a link to the underlying tracking bug.

Thanks for the follow up though @thobrla

sdkks commented 6 years ago

When I remove header with 'setmeta' by using only -h "Cache-Control", on client side I'm seeing Google's CDN (which is using this backend bucket) is also sending the same header with null value. By default if header is not set, I used to see 'public, max-age=3600' . I'm guessing we need to track this with GSC team, too...

dalbani commented 6 years ago

Hi, I have discovered this bug report by the way of https://github.com/GoogleCloudPlatform/google-cloud-python/issues/4227 and https://github.com/GoogleCloudPlatform/google-resumable-media-python/issues/34. I had this strange Checksum mismatch while downloading message when downloading GCS blobs using the official Python library. (Although the issue is supposed to be fixed, it still doesn't work for me by the way.) But regardless of the Python specific issue, I am curious to what you think of the following requests logs:

Retrieving a blob with the Python library:

> GET /download/storage/v1/b/xxx/o/binary%2F00da00d2ddc203a245753a8c1276c0d398341abd?alt=media HTTP/1.1
> Host: www.googleapis.com
> Connection: keep-alive
> accept-encoding: gzip
> Accept: */*
> User-Agent: python-requests/2.18.4
> authorization: Bearer xxx
< HTTP/1.1 200 OK
< X-GUploader-UploadID: xxx
< Content-Type: image/jpeg
< Content-Disposition: attachment
< ETag: W/COmetKubk9gCEAE=
< Vary: Origin
< Vary: X-Origin
< X-Goog-Generation: 1513588173639529
< X-Goog-Hash: crc32c=fkoHfw==,md5=Lbe8pGpkq2fctqveModTlw==
< X-Goog-Metageneration: 1
< X-Goog-Storage-Class: REGIONAL
< Cache-Control: no-cache, no-store, max-age=0, must-revalidate
< Pragma: no-cache
< Expires: Mon, 01 Jan 1990 00:00:00 GMT
< Date: Sat, 06 Jan 2018 20:02:10 GMT
< Warning: 214 UploadServer gunzipped
< Content-Length: 368869
< Server: UploadServer
< Alt-Svc: hq=":443"; ma=2592000; quic=51303431; quic=51303339; quic=51303338; quic=51303337; quic=51303335,quic=":443"; ma=2592000; v="41,39,38,37,35"

See that Warning: 214 UploadServer gunzipped header in the response. But the problem here is that the blob was specifically uploaded with Cache-Control: no-transform. Here are the details of the blob:

{
  "kind": "storage#object", 
  "contentType": "image/jpeg", 
  "name": "binary/00da00d2ddc203a245753a8c1276c0d398341abd", 
  "timeCreated": "2017-12-18T09:09:33.635Z", 
  "generation": "1513588173639529", 
  "md5Hash": "Lbe8pGpkq2fctqveModTlw==", 
  "bucket": "xxx", 
  "updated": "2017-12-18T09:09:33.635Z", 
  "contentEncoding": "gzip", 
  "crc32c": "fkoHfw==", 
  "metageneration": "1", 
  "mediaLink": "https://www.googleapis.com/download/storage/v1/b/xxx/o/binary%2F00da00d2ddc203a245753a8c1276c0d398341abd?generation=1513588173639529&alt=media", 
  "storageClass": "REGIONAL", 
  "timeStorageClassUpdated": "2017-12-18T09:09:33.635Z", 
  "cacheControl": "no-transform", 
  "etag": "COmetKubk9gCEAE=", 
  "id": "xxx/binary/00da00d2ddc203a245753a8c1276c0d398341abd/1513588173639529", 
  "selfLink": "https://www.googleapis.com/storage/v1/b/xxx/o/binary%2F00da00d2ddc203a245753a8c1276c0d398341abd", 
  "size": "368849"
}

And, sure enough, retrieving the blob using the public URL works as expected according the documentation:

$ curl -v -O https://storage.googleapis.com/xxx/binary/00da00d2ddc203a245753a8c1276c0d398341abd
> GET /xxx/binary/00da00d2ddc203a245753a8c1276c0d398341abd HTTP/1.1
> Host: storage.googleapis.com
> User-Agent: curl/7.47.0
> Accept: */*
< HTTP/1.1 200 OK
< X-GUploader-UploadID: xxx
< Date: Sat, 06 Jan 2018 20:03:39 GMT
< Cache-Control: no-transform
< Expires: Sun, 06 Jan 2019 20:03:39 GMT
< Last-Modified: Mon, 18 Dec 2017 09:09:33 GMT
< ETag: "2db7bca46a64ab67dcb6abde32875397"
< x-goog-generation: 1513588173639529
< x-goog-metageneration: 2
< x-goog-stored-content-encoding: gzip
< x-goog-stored-content-length: 368849
< Content-Type: image/jpeg
< Content-Encoding: gzip
< x-goog-hash: crc32c=fkoHfw==
< x-goog-hash: md5=Lbe8pGpkq2fctqveModTlw==
< x-goog-storage-class: REGIONAL
< Accept-Ranges: bytes
< Server: UploadServer
< Alt-Svc: hq=":443"; ma=2592000; quic=51303431; quic=51303339; quic=51303338; quic=51303337; quic=51303335,quic=":443"; ma=2592000; v="41,39,38,37,35"
< Transfer-Encoding: chunked

Out of curiousity, I tried playing with the Accept-Encoding header, but that made no difference. Setting an Accept-Encoding: gzip header in the request to the public URL returns the same, expected uncompressed result. And when disabling the Accept-Encoding: gzip header in the code of the Python library, "www.googleapis.com/download/storage/..." insists on returning a decompressed content.

So is there some UploadServer black magic going here when making a request via "www.googleapis.com/download/storage/..."?!?

dalbani commented 6 years ago

Black magic seems to be the appropriate term, because it looks like the content type of the uploaded blob has an effect on this unexpected decompression. I could for example determine that the following content types trigger the "bug":

Could some from Google report on that? Thanks.

thobrla commented 6 years ago

@dalbani : see the documentation at https://cloud.google.com/storage/docs/transcoding#gzip-gzip on Google Cloud Storage's current behavior regarding compressible content-types. Per my comments above, the work to stop GCS from removing a layer of compression isn't currently prioritized.

nojvek commented 6 years ago

I believe it’s not about “removing layer of compression”. It’s about making compression deterministic so it behaves well with the web clients.

On Wed, Jan 10, 2018 at 4:06 PM thobrla notifications@github.com wrote:

@dalbani https://github.com/dalbani : see https://cloud.google.com/storage/docs/transcoding#gzip-gzip on Google Cloud Storage's current behavior regarding compressible content-types. Per my comments above, the work to stop GCS from removing a layer of compression isn't currently prioritized.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/gsutil/issues/480#issuecomment-356780192, or mute the thread https://github.com/notifications/unsubscribe-auth/AA-JVMbVUE_ht08j8ai9QOiY_ArPSSIDks5tJVCMgaJpZM4P_smJ .

dalbani commented 6 years ago

@thobrla: thanks for your response, but I have already had a look at this documentation. And especially where it says that the Cache-Control: no-transform header should force GCS to never gunzip the data. Which it obviously does, as shown in my logs above, doesn't it? Recap: GCS doesn't behave as documented for a particular HTTP endpoint as far as I could test.

thobrla commented 6 years ago

@dalbani Thanks for the report. I can't reproduce this issue, though - I tried out your scenario with an Content-Type image/jpeg, Content-Encoding gzip, Cache-Control: no-transform object and did not see an unzipped response. Can you construct curl requests (with auth headers and bucket name omitted) that create an object that reproduces this issue?

nojvek commented 6 years ago

I am seeing a buggy behaviour too where setmeta Cache-Control overrides gzipping functionality

gsutil cp -Z foo.min.js gs://cdn-bucket/foo.min.js

accept-ranges:bytes
access-control-allow-origin:*
alt-svc:clear
cache-control:no-transform <---- undesired
content-encoding:gzip <----- correct
content-language:en
content-length:7074
content-type:application/javascript
date:Wed, 21 Feb 2018 01:21:27 GMT

After gsutil setmeta -h "Cache-Control: public,max-age=31536000" gs://cdn-bucket/foo.min.js

accept-ranges:bytes
access-control-allow-origin:*
age:127807
alt-svc:clear
cache-control:public,max-age=31536000
content-language:en
content-length:31684 <------- No content encoding gzip :(
content-type:text/css
date:Mon, 19 Feb 2018 14:01:02 GMT
etag:"691cfcaa0eb97e1f3c7d4b1687b37834"
expires:Tue, 19 Feb 2019 14:01:02 GMT
last-modified:Tue, 24 Oct 2017 00:48:44 GMT
server:UploadServer
status:200

So @thobrla it seems your recommendation of setmeta afterwards does not work.

dalbani commented 6 years ago

@thobrla Any library should be able to create a "problematic" blob but I've created an all-in-one script to show the behaviour I was talking about: https://gist.github.com/dalbani/ae837a0f00b395f875c74646eda5bfac. It shows the difference between retrieving a blob via https://www.googleapis.com/download/storage/... and storage.googleapis.com, including the strange effect of some content types.

TL;DR: the so-called "UploadServer" treats some content types differently than other when downloading resources via https://www.googleapis.com/download/storage/....

For example, let's say I run the script with an empty, gzip'ed 32x32 JPEG file:

$ convert -size 32x32 xc:white /tmp/32x32.jpg
$ cat /tmp/32x32.jpg | gzip -9 > /tmp/32x32.jpg.gz
$ ls -l /tmp/32x32.jpg*
-rw-rw-r-- 1 me me 165 Mar  2 23:08 /tmp/32x32.jpg
-rw-rw-r-- 1 me me 137 Mar  2 23:08 /tmp/32x32.jpg.gz

If I run my script with ./test-gcs.sh 32x32.jpg /tmp/32x32.jpg.gz image/jpeg, I get the following output:

{
  "kind": "storage#object",
  "id": "xyz/32x32.jpg/1520033236514750",
  "selfLink": "https://www.googleapis.com/storage/v1/b/xyz/o/32x32.jpg",
  "name": "32x32.jpg",
  "bucket": "xyz",
  "generation": "1520033236514750",
  "metageneration": "1",
  "contentType": "image/jpeg",
  "timeCreated": "2018-03-02T23:27:16.513Z",
  "updated": "2018-03-02T23:27:16.513Z",
  "storageClass": "REGIONAL",
  "timeStorageClassUpdated": "2018-03-02T23:27:16.513Z",
  "size": "137",
  "md5Hash": "rV9N/0RX6QgkCjpDIi2Lyw==",
  "mediaLink": "https://www.googleapis.com/download/storage/v1/b/xyz/o/32x32.jpg?generation=1520033236514750&alt=media",
  "contentEncoding": "gzip",
  "cacheControl": "no-transform",
  "acl": [
    ...,
    {
      "kind": "storage#objectAccessControl",
      "id": "xyz/32x32.jpg/1520033236514750/allUsers",
      "selfLink": "https://www.googleapis.com/storage/v1/b/xyz/o/32x32.jpg/acl/allUsers",
      "bucket": "xyz",
      "object": "32x32.jpg",
      "generation": "1520033236514750",
      "entity": "allUsers",
      "role": "READER",
      "etag": "CL7v74jlztkCEAE="
    }
  ],
  "owner": {
    "entity": "..."
  },
  "crc32c": "ttHcwA==",
  "etag": "CL7v74jlztkCEAE="
}

==> https://www.googleapis.com/download/storage/v1/b/xyz/o/32x32.jpg?generation=1520033236514750&alt=media <== (No "Accept-Encoding" header)
HTTP/1.1 200 OK
X-GUploader-UploadID: AEnB2Uq-q5SJGTO2-LXhvc-e9LX6AVG-i4TAtNVBs9oI-OtywZ-oyUGrb_EHAT8qUbXC6lDKB9NR-1Oy_odHur7Ndx6Kq45XDg
Content-Type: image/jpeg
Content-Disposition: attachment
ETag: W/CL7v74jlztkCEAE=
Vary: Origin
Vary: X-Origin
X-Goog-Generation: 1520033236514750
X-Goog-Hash: crc32c=ttHcwA==,md5=rV9N/0RX6QgkCjpDIi2Lyw==
X-Goog-Metageneration: 1
X-Goog-Storage-Class: REGIONAL
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Pragma: no-cache
Expires: Mon, 01 Jan 1990 00:00:00 GMT
Date: Fri, 02 Mar 2018 23:27:16 GMT
Warning: 214 UploadServer gunzipped
Content-Length: 165
Server: UploadServer

165

==> https://www.googleapis.com/download/storage/v1/b/xyz/o/32x32.jpg?generation=1520033236514750&alt=media <== ("Accept-Encoding: gzip" header)
HTTP/1.1 200 OK
X-GUploader-UploadID: AEnB2UqElToltlztZmVJ5kHTg7-MHRNwxHru-o1ta1kfylxKEQ66zZ8JU36gsz0nqgA8Jrmx86B7MJpUJ1EjVsfIWHOve-3Q4w
Content-Type: image/jpeg
Content-Disposition: attachment
Vary: X-Origin
X-Goog-Generation: 1520033236514750
X-Goog-Hash: crc32c=ttHcwA==,md5=rV9N/0RX6QgkCjpDIi2Lyw==
X-Goog-Metageneration: 1
X-Goog-Storage-Class: REGIONAL
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Pragma: no-cache
Expires: Mon, 01 Jan 1990 00:00:00 GMT
Date: Fri, 02 Mar 2018 23:27:16 GMT
Server: UploadServer
Accept-Ranges: none
Vary: Origin,Accept-Encoding
Transfer-Encoding: chunked

165

==> https://storage.googleapis.com/xyz/32x32.jpg <==
HTTP/1.1 200 OK
X-GUploader-UploadID: AEnB2Ur42wt38EuxIxQo6FPFRMCYav2YPQpvxw1GFE-6vq4jpuQNbgU1r5vXN_JjzYzoRCwNgtwldZUF9JenNyYPE0oDaW9_Vg
Date: Fri, 02 Mar 2018 23:27:16 GMT
Cache-Control: no-transform
Expires: Sat, 02 Mar 2019 23:27:16 GMT
Last-Modified: Fri, 02 Mar 2018 23:27:16 GMT
ETag: "ad5f4dff4457e908240a3a43222d8bcb"
x-goog-generation: 1520033236514750
x-goog-metageneration: 1
x-goog-stored-content-encoding: gzip
x-goog-stored-content-length: 137
Content-Type: image/jpeg
Content-Encoding: gzip
x-goog-hash: crc32c=ttHcwA==
x-goog-hash: md5=rV9N/0RX6QgkCjpDIi2Lyw==
x-goog-storage-class: REGIONAL
Accept-Ranges: bytes
Server: UploadServer
Transfer-Encoding: chunked

137

See that both requests to https://www.googleapis.com/download/... return gunzip'ed data, without mentioning the encoding (and only with Warning: 214 UploadServer gunzipped when no Accept-Encoding: gzip header were present in the request?!). Summary: responses are respectively of 165, 165 and 137 bytes.

This transparent gunzip causes problems with, for example, the Python library, which sends an Accept-Encoding: gzip header and thus falls into case number 2.

...
google.resumable_media.common.DataCorruption: Checksum mismatch while downloading:

  https://www.googleapis.com/download/storage/v1/b/xyz/o/32x32.jpg?alt=media

The X-Goog-Hash header indicated an MD5 checksum of:

  rV9N/0RX6QgkCjpDIi2Lyw==

but the actual MD5 checksum of the downloaded contents was:

  Dpx7jzPpJiEyPwovSJL/fA==

Now, let's compare with the output of the same command but with a different content type, e.g. ./test-gcs.sh 32x32.jpg /tmp/32x32.jpg.gz image/xyz:

{
  "kind": "storage#object",
  "id": "xyz/32x32.jpg/1520033806814916",
  "selfLink": "https://www.googleapis.com/storage/v1/b/xyz/o/32x32.jpg",
  "name": "32x32.jpg",
  "bucket": "xyz",
  "generation": "1520033806814916",
  "metageneration": "1",
  "contentType": "image/xyz",
  "timeCreated": "2018-03-02T23:36:46.813Z",
  "updated": "2018-03-02T23:36:46.813Z",
  "storageClass": "REGIONAL",
  "timeStorageClassUpdated": "2018-03-02T23:36:46.813Z",
  "size": "137",
  "md5Hash": "rV9N/0RX6QgkCjpDIi2Lyw==",
  "mediaLink": "https://www.googleapis.com/download/storage/v1/b/xyz/o/32x32.jpg?generation=1520033806814916&alt=media",
  "contentEncoding": "gzip",
  "cacheControl": "no-transform",
  "acl": [
    ...
    {
      "kind": "storage#objectAccessControl",
      "id": "xyz/32x32.jpg/1520033806814916/allUsers",
      "selfLink": "https://www.googleapis.com/storage/v1/b/xyz/o/32x32.jpg/acl/allUsers",
      "bucket": "xyz",
      "object": "32x32.jpg",
      "generation": "1520033806814916",
      "entity": "allUsers",
      "role": "READER",
      "etag": "CMSd6JjnztkCEAE="
    }
  ],
  "owner": {
    "entity": "..."
  },
  "crc32c": "ttHcwA==",
  "etag": "CMSd6JjnztkCEAE="
}

==> https://www.googleapis.com/download/storage/v1/b/xyz/o/32x32.jpg?generation=1520033806814916&alt=media <== (No "Accept-Encoding" header)
HTTP/1.1 200 OK
X-GUploader-UploadID: AEnB2Uq8B_5VWTlAYbs6pPWsBb2Ap7DJ2gyVi_ZUWA6noZ0m7dflv9hn1siBbwQGRkOk0g6CMw_j1eOlRmzpJoBylLX-FupKEA
Content-Type: image/xyz
Content-Disposition: attachment
ETag: W/CMSd6JjnztkCEAE=
Vary: Origin
Vary: X-Origin
X-Goog-Generation: 1520033806814916
X-Goog-Hash: crc32c=ttHcwA==,md5=rV9N/0RX6QgkCjpDIi2Lyw==
X-Goog-Metageneration: 1
X-Goog-Storage-Class: REGIONAL
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Pragma: no-cache
Expires: Mon, 01 Jan 1990 00:00:00 GMT
Date: Fri, 02 Mar 2018 23:36:46 GMT
Warning: 214 UploadServer gunzipped
Content-Length: 165
Server: UploadServer

165

==> https://www.googleapis.com/download/storage/v1/b/xyz/o/32x32.jpg?generation=1520033806814916&alt=media <== ("Accept-Encoding: gzip" header)
HTTP/1.1 200 OK
X-GUploader-UploadID: AEnB2Uq63b0MPQcdUOVj4GxXRJqlXifJTg_6xhUjZe8KVKb6hsRxXGo1VbmIUraY2EjQ6WpMtdhJysQE8AyorbF_QkelHoGcx6wq4vsyX9WNBlPTGoqisMY
Content-Type: image/xyz
Content-Disposition: attachment
Content-Encoding: gzip
ETag: CMSd6JjnztkCEAE=
Vary: Origin
Vary: X-Origin
X-Goog-Generation: 1520033806814916
X-Goog-Hash: crc32c=ttHcwA==,md5=rV9N/0RX6QgkCjpDIi2Lyw==
X-Goog-Metageneration: 1
X-Goog-Storage-Class: REGIONAL
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Pragma: no-cache
Expires: Mon, 01 Jan 1990 00:00:00 GMT
Date: Fri, 02 Mar 2018 23:36:46 GMT
Server: UploadServer
Transfer-Encoding: chunked

137

== https://storage.googleapis.com/xyz/32x32.jpg ==
HTTP/1.1 200 OK
X-GUploader-UploadID: AEnB2UqIGXYH1gXIkhJAYRHfGjzqYExKxvN3MunX9IE9iZdJDq0Uwkn-CFRxwXjsccCuFKwO5izsEdtEc2R3u5DzrrRVg2ikJA
Date: Fri, 02 Mar 2018 23:36:47 GMT
Cache-Control: no-transform
Expires: Sat, 02 Mar 2019 23:36:47 GMT
Last-Modified: Fri, 02 Mar 2018 23:36:46 GMT
ETag: "ad5f4dff4457e908240a3a43222d8bcb"
x-goog-generation: 1520033806814916
x-goog-metageneration: 1
x-goog-stored-content-encoding: gzip
x-goog-stored-content-length: 137
Content-Type: image/xyz
Content-Encoding: gzip
x-goog-hash: crc32c=ttHcwA==
x-goog-hash: md5=rV9N/0RX6QgkCjpDIi2Lyw==
x-goog-storage-class: REGIONAL
Accept-Ranges: bytes
Server: UploadServer
Transfer-Encoding: chunked

137

Summary: responses are respectively of 165, 137 (not 165 as above!) and 137 bytes. In the 2nd request, no automatic gunzip has taken place in "UploadServer", only because of the content type?!

And here the Python library has no problem downloading the blob and checking the MD5 checksum.

Although very long, I hope this post is clear enough so you can pinpoint and eventually fix the issue. Thanks for your attention!

thobrla commented 6 years ago

Thanks @dalbani - I think at this point the problem is well understood and we're waiting for the Cloud Storage team to prioritize a fix (but to my knowledge it's not currently prioritized).

yonran commented 6 years ago

The Object Transcoding documentation and gsutil cp documentation should probably be modified to indicate that gsutil cp -z disables decompressive transcoding.

acoulton commented 4 years ago

I would go further than @yonran and say the documentation should definitely be modified : this is a really frustrating omission. Also, for publishing static web assets, it's really frustrating to have no flag/option to disable this behaviour : I have no need to ever download these again with gsutil so the checksum thing isn't an issue - I just want to gzip them on the way up and have GCS then serve them to clients in line with the documentation....

starsandskies commented 3 years ago

Closing this issue - documentation is now updated on cloud.google.com, and I'm backfilling the source files here in Github to match.

acoulton commented 3 years ago

@starsandskies great that the docs have been updated - thanks for that. I'm not sure it's valid to close this issue though.

When uploading more than a few files - e.g. for web assets / static sites - it is extremely inefficient to have to run gsutil cp -m - z -r $dir gs://$bucket/$path and then a separate gsutil -m -r setmeta -h "Cache-Control:public, max-age=.." gs://$bucket/$path afterwards to fix the cache header.

That adds a fair time overhead, and more importantly creates the risk of objects existing in the bucket with inconsistent / unexpected state if the second setmeta command fails for any reason.

If we specify gsutil cp -r -z -h "Cache-Control:public, max-age=..." then at very least gsutil should emit a runtime warning that our explicit -h value has been ignored / overwritten. But it would be much better if gsutil respected a command-line explicit value in preference to the default. Or if that's really not possible for BC reasons, then an explicit flag to disable this behaviour.

FWIW although the docs are now clearer I think it's still not obvious that at present the Cache-Control:no-transform completely overwrites any Cache-Control header set on the command line.

starsandskies commented 3 years ago

I definitely think there are improvements to the tool that could be made (and, fwiw, the push to fix the underlying behavior that necessitates the -z behavior had a renewed interest at the end of 2020). I've no objection to re-opening this (I assume you're able to, though let me know if not - I'm not a Github expert by any stretch of the imagination), but this thread has gotten quite long and meander-y. I'd recommend taking the relevant points and making a fresh issue that cuts away the excess.

acoulton commented 3 years ago

@starsandskies thanks for the response - no, I can't reopen, only core contributors/admins can reopen on Github.

I couldn't see a branch / pull request relevant to the underlying behaviour that necessitates the -z behaviour, do you mean server-side on GCS as mentioned up the thread, or is there an issue / pull request open for that elsewhere that I could reference / add to?

I'm happy to make a new issue, tho I think the issue description and first couple of comments here (e.g. https://github.com/GoogleCloudPlatform/gsutil/issues/480#issuecomment-338050378) capture the problem and IMO there's an advantage to keeping this issue alive since there are already people watching it.

But if you'd prefer a new issue I'll open one and reference this.

starsandskies commented 3 years ago

Ah, in that case, I'll reopen this.

To answer you question, my understanding is that what blocks a true fix is on the server side and that it affects other tools as well, such as the client libraries (see, for example, https://github.com/googleapis/nodejs-storage/issues/709)

acoulton commented 3 years ago

@starsandskies thanks :)

Yes, I see the problem on that nodejs-storage issue. I think though it breaks into two usecases:

AFAICS the second usecase was working without any problems, until the gsutil behaviour was changed to fix the first case.

The key thing is that it's obviously still valid to have gzipped files in the bucket with transitive decompression enabled - nothing stops you setting your own Cache-Control header after the initial upload. And that obviously fixes usecase 2 but breaks usecase 1. That being the case, I don't think there's any good reason why gsutil should silently prevent you from doing that in a single call, even if you want to keep the default behaviour as it is now.

MrTrustworthy commented 3 years ago

Since we just stumbled upon this issue when trying to move to GCP/GCS for our CDN assets, and this thread was very helpful in figuring out why, I wanted to leave a piece of feedback from the user side for this topic.

There are many responses (I assume from maintainers/developers at GCP/gsutil) that suggest that per default adding the no-transform setting with no way to disable it is the best possible option. Example:

So if we add such an option to drop no-transform, we're back in the situation we were in before 439573e where certain files uploaded by gsutil cannot then be downloaded by gsutil, and this seems worse than not being downloadable by a different client.

I just want to say that, as a user of GCP, I harshly disagree with that assessment.

From my perspective, the actual state of things is that gsutil is simply bugged and won't allow you to download files if they are uploaded via cp -z/Z. This is not nice, but acceptable - tools have bugs sometimes, and they need to be prioritised and fixed. But instead, the cp behaviour was modified to break CDN users in a way that's very hard to detect in the first place.

For an outside user of GCP, it seems like the respective team isn't interested in fixing the bug, so it's hiding the issue behind a different and harder-to-notice issue, just so it's technically not broken on their end per their definition. As a user of GCP, I don't care about whether gsutils technically works correctly, I care whether my entire GCP setup works correctly - and it currently doesn't.

To be clear: the default behaviour of gsutil cp -z/Z, when used for CDN purposes (which is probably the main reason people use -z/Z for in the first place) is to silently break HTTP specs. After uploading our assets, our pages suddenly delivered compressed assets even if the clients didn't support them. This is simply wrong. If the CDN was automatically configured to send the correct (406) response in those cases, it would be somewhat acceptable - but it doesn't.

In my personal view, gsutils should simply be allowed to break when trying to download compressed files, maybe for now with a nice error message explaining why. If the download is severe enough, then a fix should be prioritised accordingly. But silently breaking HTTP specs for CDN users to hide the download bug is not acceptable IMHO.

frankyn commented 2 years ago

Short update: As of end of this week Cloud Storage API will always respect Accept-Encoding: gzip which caused the underlying issue in that GCS would decompress data even when not requested.

Rolled back the change so we will need to follow-up again when we have an update. Apologies, I jinxed it.