Figure out story for object content encodings

jacobsa commented 9 years ago

GCS objects have a contentEncoding property, sort of but not really documented here. That page implies that maybe it is always echoed as Content-Encoding when serving a read for the object, but it's not clear. This page says that it's intended to work with a value of gzip, and sort of implies by omission that it's not intended to work with other encodings. This page has slightly more detail about motivations and behavior.

Throw into the mix the fact that Go's http.Transport automatically sets Accept-Encoding: gzip on requests if no other Accept-Encoding is set (cf. Transport.DisableCompression), then transparently decompresses if it gets Content-Encoding: gzip back, and this starts to get confusing.

To do:

Figure out what our current behavior is for objects with and without contentEncoding set, for values gzip and otherwise.
Don't forget to test reading sub-ranges of such objects. What happens?
Figure out what our behavior should be and document it in semantics.md.
Add integration tests and make sure behavior matches the documentation.

(Thanks to Jurek Papiorek for raising this issue.)

jacobsa commented 9 years ago

Don't forget:

Integration tests for the behavior of object composition.
Integration tests involving storage of actual .gz files.

jacobsa commented 9 years ago

This is made more difficult by Google-internal bug 24347854 (which I just discovered): if you upload invalid gzip content and then go to read it back, you always get HTTP 503 no matter what you set for Accept-Encoding.

jacobsa commented 9 years ago

Filed Google-internal bug 24347482 for the underspecified documentation on what GCS is expected to do in a bunch of cases.

jacobsa commented 9 years ago

I've come to the conclusion that contentEncoding shouldn't/can't be supported by gcsfuse in any specific way. Rather, we should treat this like versioned buckets and explicitly say the behavior is undefined when you use such objects with gcsfuse, and advise against doing so.

Brain dump about how the contentEncoding feature is problematic:

It is bug-prone: if you claim that content is gzip when it is not, GCS will serve an HTTP 503 when you go to read it. (See Google-internal bug 24347854.) I found this in my first five minutes with the feature, which makes me think there may be numerous other bugs lurking.
The previous point is made worse by the fact that GCS treats some valid gzip content as invalid, serving a 503. (See Google bug 24693623.)
The feature interacts poorly with the rest of the GCS API. It appears to be intended to support what I'll call "the CDN case": serving media to browsers that will take the gzip-encoded content and decode it to what the user wants to see. That works fine, but when you're using GCS carefully as a storage API it's not as good. For example, there's no way to see the length of the pre-gzip content, and you can only meaningfully compose two objects if either they are both gzipped or neither is gzipped.
The feature pretends to be general—you can set contentEncoding to any string you want—but the documentation only specifies what will happen for gzip. In Google bug 24347482 it was clarified to me that other encodings are simply ignored. But this is hardly confidence-inspiring—who's to say that GCS won't suddenly start supporting bzip2, changing the behavior of a whole class of requests? Even if that never happens, you may be behind an intermediate proxy who groks bzip2.
Because you can't see the pre-gzip length of the data, gcsfuse would have no choice but to surface the post-gzip data as the content of files, so that the file metadata matched the contents. Okay, that's fine, we would just read that data and return it to the user. Except the documentation doesn't make it clear that there is any reliable way to opt out of GCS's magic behavior around encodings.

If I set Accept-Encoding: gzip on my read requests, it appears to return the original content. But given the usual use of this header, I worry that it's possible that some internal system will decode the content then some other will later re-encode it, yielding different bytes. Worse, I worry that this will cause objects without any contentEncoding property set to be gzip-encoded before being sent to me, in the mistaken thought that I'm setting this header to save bandwidth rather than to opt out of the feature. The documentation is less than helpful in making me confident this won't happen.
More generally, even if GCS is totally religious about fixing the point above, there always may be an intermediate HTTP proxy that decides to screw with the content returned by GCS when it sees Accept-Encoding: gzip, especially for a read of an object that is not already encoded. Again, this feature appears to be intended only for the "user staring at content in a browser" case; otherwise the designers of the GCS API made a mistake by overloading Accept-Encoding and Content-Encoding for this feature.
The feature appears to cause GCS to ignore Range headers in requests in several cases (see Google bug 24347482), which means we can't efficiently read only a portion of a very large object.
This is touched on in points above, but it's worth restating: the documentation for this feature is extremely underspecified, making it stressful to even get started writing code against it. The Google bug ID for this is 24347482.

jacobsa commented 9 years ago

Here is a patch that starts to add contentEncoding-related tests to jacobsa/gcloud@ca4fb08, for posterity.

GoogleCloudPlatform / gcsfuse

Figure out story for object content encodings #131