Add support for gzip encoded responses

konradpabjan commented 4 years ago

Overview

Largely porting over some changes from the microsoft/type-rest-client that by default supports handling encoded responses: https://github.com/microsoft/typed-rest-client/pull/189

This is needed for the v2 version of download-aritfact that is currently in preview and has some problems that are related to gzip. https://github.com/actions/download-artifact/issues/23

When currently calling readBody() the response body gets converted to a string with the default javascript encoding which is utf-16. When trying to ungzip this string, it is no longer possible to do because some information gets lost during the initial string conversion in this package. I've tried a host of different things including piping the resulting string to a file and attempting to ungzip it, but there is always this incorrect header check which indicates the data is in an incorrect format:

The really simple program that I've been using for testing.

I've also tried a host of different things using the node Buffer class to go between encodings, but it seems like there is some permanent information lost that cannot be retried after readBody() gets called: https://nodejs.org/docs/latest/api/buffer.html#buffer_buffers_and_character_encodings. I'm able to ungzip the body using the raw fiddler response when an artifact download is happening, so the issue is definitely in this package.

Testing

Added a test to make sure the content correctly gets decoded

Successful end-to-end artifact download/upload with the changes: https://github.com/konradpabjan/artifact-test/runs/551807430?check_suite_focus=true

Downloading the artifacts from the UI shows the expected size, and inspecting the files reveals no gzip content which is expected.

After running npm build with these changes, i copied over the output to the node_modules directory in the forked @actions/download-artifact and @actions/upload-artifact directory that I've been testing with.

yacaovsnc commented 4 years ago

I'm not sure that we always want to decode gzip responses automatically for the user. As an action author, I would expect the HTTP client to give me the raw data so that I can use it however I want.

I guess there are two use cases here:

The user is requesting a gzip file stored on the server
The user is requesting a regular file, but explicitly set header accept-encoding: gzip, and the server decided to apply HTTP compression during transfer.

In case 2, the HTTP client should handle decompression automatically, do you agree?

@konradpabjan, when we get the stream, is it a gzip of a gzip? Did we upload gzip? Did the server apply compression again during transfer? If we are use case 1, what is the header value of accept-encoding?

konradpabjan commented 4 years ago

@konradpabjan, when we get the stream, is it a gzip of a gzip? Did we upload gzip? Did the server apply compression again during transfer? If we are use case 1, what is the header value of accept-encoding?

There is no gzip of a gzip. When files are uploaded using upload-artifact, they are first gziped locally, then the Content-Encoding header gets set to gzip and that information gets saved somewhere in the DB for the individual uploaded file.

When making the GET request for an individual file during download, there is a an Accept-Encoding header that you can set to gzip (I set for all the requests, if you don't include this, then the Content-Encoding header is empty). When a response is received, the Content-Encoding header is either set to gzip or it is empty depending on how the file was originally uploaded.

During artifact transfer, there is no extra gzip compression applied by the server, the only time the server does anything with gzip is when the zip file is getting creating when it is being downloaded from the UI (previously gzipped files get ungzipped at this point).

I'm not sure that we always want to decode gzip responses automatically for the user. As an action author, I would expect the HTTP client to give me the raw data so that I can use it however I want.

I think 99% of the time, if someone is getting a response, and the content is in gzip, they're going to want to un-gzip that content immediately so they can start working with it (that's what the microsoft/typed-rest-client is doing). In principle, i guess it would be nice to have the raw gzip data but why? Valid use cases seem very limited and almost everyone would immediately un-gzip the raw data. Gzip is mostly used to reduce to the amount of data being transferred over so it's effectively like a low level implementation detail that our users won't need so I'm leaning towards always decoding gzip responses. Looking at issues in the microsoft/types-rest-client there have been no complaints or requests for raw gzip data since the feature was added.

yacaovsnc commented 4 years ago

So this is our flow:

A user wants to upload fileA.txt
Upload-Artifacts compresses fileA.txt into a gzip stream, and send it up with Content-Encoding: gzip
Download-Artifacts request the fileA.txt with accept-encoding: gzip
The server gets the content of fileA.txt, figures out it's already gzip-compressed, so just send the stream back with Content-Encoding: gzip and does not reapply compression
The http client checks the Content-Encoding header, and decompress the stream back into fileA.txt
Download-Artifacts saves the response to file

So this flow is basically applying HTTP compression - and in this case we want to hide this layer from the user.

If user wants to download a raw gzip file, they will have to upload a gzip with Content-Encoding: identity? And should either request the file without accept-encoding: gzip or let the server apply compression again?

This sounds right.

konradpabjan commented 4 years ago

Well @hashtagchris is a genius, and it is possible to successfully un-gzip a downloaded file without making changes here.

Since this is no longer needed, I'm closing this PR 🎉

actions / http-client

Add support for gzip encoded responses #15

Overview

Testing