GenSpectrum / LAPIS

An API, a query engine, and a database schema for genomic sequences; currently with a focus on SARS-CoV-2
https://lapis-three.vercel.app
GNU Affero General Public License v3.0
22 stars 6 forks source link

Fix compression headers #665

Closed chaoran-chen closed 8 months ago

chaoran-chen commented 8 months ago

Currently, when we go to aggregated?compression=gzip, we get a response with

Content-Encoding: gzip
Content-Type: application/json

Similarly, if choosing ?compression=zstd, one get a response with a Content-Encoding: zstd. Opening in a browser, this causes an error:

image

I think that the headers are actually quite wrong. If someone requests a compressed file, then the content is not JSON (or TSV or FASTA) but it's a GZIP or ZSTD file. I.e., there shouldn't be a Content-Encoding and the Content-Type should be application/gzip or application/zstd (I checked that both are official IANA media types) or maybe just application/octet-stream.

fengelniederhammer commented 8 months ago

Actually, after doing some research, I think that our implementation follows the specification/standards:

https://www.rfc-editor.org/rfc/rfc9110#name-content-encoding

Content-Encoding is primarily used to allow a representation's data to be compressed without losing the identity of its underlying media type.

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Type

The Content-Type representation header is used to indicate the original media type of the resource (prior to any content encoding applied for sending).

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Encoding

The Content-Encoding representation header lists any encodings that have been applied to the representation (message payload), and in what order. This lets the recipient know how to decode the representation in order to obtain the original payload format. Content encoding is mainly used to compress the message data without losing information about the origin media type.

The key point seems to be "without losing information about the origin media type".

Browsers AFAIK don't support zstd (also curl does not display the data because "binary data might mess with your terminal"). If we set the Content-Type to something else, then a browser probably still won't be able to display any meaningful data (I didn't try it though).

I'd argue that the current implementation is fine, so I'll move this to blocked until we decide that we still want to change it.

chaoran-chen commented 8 months ago

If a user does not set compression in the URL and has an Accept-Encoding header that includes e.g. gzip, then, by the HTTP specification, it means that the user can understand responses encoded with gzip. In other words, the user is willing to receive gzip-encoded data.

The logic for the compression query parameter is different. If the user sets compression to gzip, the resource/content that the user wants is a gzip file. The browser should never try to decode it because the resource/content that the user requested is the gzip and not an uncompressed file.

This is similar to our dataFormat query parameter vs the Accept HTTP header. If the dataFormat is TSV, then the user wants a TSV file and nothing else.

chaoran-chen commented 8 months ago

The current implementation doesn't work, does it? The user should be able to go to a particular link in their browser and the browser should then download a GZIP or ZSTD compressed file in the sense that it puts a compressed file onto their disk.

As the feature is very important and needed for our other project, I'll put it back into prioritized.

fengelniederhammer commented 8 months ago

So the feature is actually about downloading the file instead of showing it in the browser? This can already be achieved by setting compression=zstd&downloadAsFile=true.

Should we make downloadAsFile=true (or rather Content-Disposition: attachment in the response) implied by compression=zstd? But I'm not sure whether this is a good idea. I'm quite sure that it's a bad idea for gzip, as browsers usually demand gzip compressed data. There it should be up to the user to decide whether to download or not. Doing it only for zstd might be surprising. Also, downloading as a file and compression are two different dimensions. If users wants to have both, they should specify both.

We could also simply mention in the docs that usually you want to downloadAsFile=true, when asking for zstd compressed files (at least when downloading via the browser - using curl or wget, this isn't an issue anyway).

chaoran-chen commented 8 months ago

This can already be achieved by setting compression=zstd&downloadAsFile=true.

This doesn't work in (current and older) versions of Chrome:

Screenshot 2024-02-27 at 18 01 17

In future versions or if you enable zstd in Chrome's settings (https://caniuse.com/?search=zstd), it works.

But I'm not sure whether this is a good idea. I'm quite sure that it's a bad idea for gzip, as browsers usually demand gzip compressed data. There it should be up to the user to decide whether to download or not.

If a user asks for a gzipped TSV file (which is not the same as asking for a TSV file which may be encoded as a GZIP or not), then they should receive a GZIP file. Showing it in a decompressed format shall not be considered a right way to display the content. (A "correct" way could be to show the binary in a hex format but of course, no browser does that.)

I think that we should fully separate the query parameters from the Accept-Encoding header. A query defines what the content should be whereas Accept-Encoding is usually transparent to the user.

fengelniederhammer commented 8 months ago

I just tried https://lapis.cov-spectrum.org/gisaid/v2/sample/aggregated?accessKey=9C[...]Pd&downloadAsFile=true&compression=zstd. It triggers a download, both in Firefox (123.0) and Chrome (122.0.6261.94).

As we figured out, it works in Chrome, because the Linux build has the zstd flag enabled and tries to decompress zstd. If the flag is disabled, Chrome will show a failed request (even though it was successful - ignoring the Content-Disposition header). IMO this is a Chrome issue.

As dicsussed, we will change it to overwrite the Content-Type header, if the compression parameter is set.