Closed chaoran-chen closed 8 months ago
Actually, after doing some research, I think that our implementation follows the specification/standards:
https://www.rfc-editor.org/rfc/rfc9110#name-content-encoding
Content-Encoding is primarily used to allow a representation's data to be compressed without losing the identity of its underlying media type.
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Type
The Content-Type representation header is used to indicate the original media type of the resource (prior to any content encoding applied for sending).
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Encoding
The Content-Encoding representation header lists any encodings that have been applied to the representation (message payload), and in what order. This lets the recipient know how to decode the representation in order to obtain the original payload format. Content encoding is mainly used to compress the message data without losing information about the origin media type.
The key point seems to be "without losing information about the origin media type".
Browsers AFAIK don't support zstd (also curl does not display the data because "binary data might mess with your terminal"). If we set the Content-Type
to something else, then a browser probably still won't be able to display any meaningful data (I didn't try it though).
I'd argue that the current implementation is fine, so I'll move this to blocked until we decide that we still want to change it.
If a user does not set compression
in the URL and has an Accept-Encoding
header that includes e.g. gzip, then, by the HTTP specification, it means that the user can understand responses encoded with gzip. In other words, the user is willing to receive gzip-encoded data.
The logic for the compression
query parameter is different. If the user sets compression
to gzip, the resource/content that the user wants is a gzip file. The browser should never try to decode it because the resource/content that the user requested is the gzip and not an uncompressed file.
This is similar to our dataFormat
query parameter vs the Accept
HTTP header. If the dataFormat
is TSV, then the user wants a TSV file and nothing else.
The current implementation doesn't work, does it? The user should be able to go to a particular link in their browser and the browser should then download a GZIP or ZSTD compressed file in the sense that it puts a compressed file onto their disk.
As the feature is very important and needed for our other project, I'll put it back into prioritized.
So the feature is actually about downloading the file instead of showing it in the browser? This can already be achieved by setting compression=zstd&downloadAsFile=true
.
Should we make downloadAsFile=true
(or rather Content-Disposition: attachment
in the response) implied by compression=zstd
?
But I'm not sure whether this is a good idea. I'm quite sure that it's a bad idea for gzip, as browsers usually demand gzip compressed data. There it should be up to the user to decide whether to download or not. Doing it only for zstd might be surprising. Also, downloading as a file and compression are two different dimensions. If users wants to have both, they should specify both.
We could also simply mention in the docs that usually you want to downloadAsFile=true
, when asking for zstd compressed files (at least when downloading via the browser - using curl or wget, this isn't an issue anyway).
This can already be achieved by setting compression=zstd&downloadAsFile=true.
This doesn't work in (current and older) versions of Chrome:
In future versions or if you enable zstd in Chrome's settings (https://caniuse.com/?search=zstd), it works.
But I'm not sure whether this is a good idea. I'm quite sure that it's a bad idea for gzip, as browsers usually demand gzip compressed data. There it should be up to the user to decide whether to download or not.
If a user asks for a gzipped TSV file (which is not the same as asking for a TSV file which may be encoded as a GZIP or not), then they should receive a GZIP file. Showing it in a decompressed format shall not be considered a right way to display the content. (A "correct" way could be to show the binary in a hex format but of course, no browser does that.)
I think that we should fully separate the query parameters from the Accept-Encoding
header. A query defines what the content should be whereas Accept-Encoding
is usually transparent to the user.
I just tried https://lapis.cov-spectrum.org/gisaid/v2/sample/aggregated?accessKey=9C[...]Pd&downloadAsFile=true&compression=zstd. It triggers a download, both in Firefox (123.0) and Chrome (122.0.6261.94).
As we figured out, it works in Chrome, because the Linux build has the zstd flag enabled and tries to decompress zstd. If the flag is disabled, Chrome will show a failed request (even though it was successful - ignoring the Content-Disposition
header). IMO this is a Chrome issue.
As dicsussed, we will change it to overwrite the Content-Type
header, if the compression
parameter is set.
Currently, when we go to
aggregated?compression=gzip
, we get a response withSimilarly, if choosing
?compression=zstd
, one get a response with aContent-Encoding: zstd
. Opening in a browser, this causes an error:I think that the headers are actually quite wrong. If someone requests a compressed file, then the content is not JSON (or TSV or FASTA) but it's a GZIP or ZSTD file. I.e., there shouldn't be a
Content-Encoding
and theContent-Type
should beapplication/gzip
orapplication/zstd
(I checked that both are official IANA media types) or maybe justapplication/octet-stream
.