DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
7 stars 2 forks source link

File extension being truncated when matrix files are downloaded (HCA) #5770

Open bvizzier-ucsc opened 11 months ago

bvizzier-ucsc commented 11 months ago

Slack thread

Description of the problem as reported by the user:

Sometimes, when I download matrix files that has a gz compression, I get the file but without the gz extention although it is downloaded compressed. For example files here https://explore.data.humancellatlas.org/projects/bfaedc29-fe84-4e72-a461-75dc8aabbd1b/project-matrices In other cases, I might have the gz extension but not the file extension like RDS For example files like Adrenal_gene_count.RDS from here https://explore.data.humancellatlas.org/projects/a9301beb-e9fa-42fe-b75c-84e8a460c733/project-matrices is downloaded like Adrenal_gene_count.gz and Thymus_gene_count.RDS as Thymus_gene_count.gz This was occuring in DCP 2, too. I am using Google Chrome v119 on macOS 14.1.1. Thank you!

Dave Rogers investigated the problem and reported:

This looks like it may be an Azul bug to us @Ben Vizzier . We see the request being made with the correct file name. We use the full correct file name in the HTML5 download attribute. I do see Azul is not setting a Content Disposition hearder like Content-Disposition: attachment; filename="filename.jpg" in the response, I also see that the Content-Type content type does not reflect the gz status so this may encourage the browser to drop the .gz extension. Or so says the internet. In any case I can not see anything the front end is doing incorrectly.

achave11-ucsc commented 11 months ago

Assignee to consider next steps.

hannes-ucsc commented 11 months ago

I do see Azul is not setting a Content Disposition hearder like Content-Disposition: attachment; filename="filename.jpg" in the response, I also see that the Content-Type content type does not reflect the gz status so this may encourage the browser to drop the .gz extension. Or so says the internet.

The file is served not by Azul, but by Google Cloud Storage.

image

The screenshot shows that there are two requests at play here. The first request is to Azul. While serving that request, Azul acts as a client to TDR's DRS implementation in order to obtain a signed URL to the file. Azul then returns that signed URL verbatim to DB. DB then makes a second request, a request to that signed URL. The request goes to Google, not Azul. The signed URL points to a file in a GCS bucket owned and controlled by TDR.

Here are the response headers for the second request:

image

Because Google responds without a content-disposition header, without a content-encoding header and with a content-type header that falsely declares the file as CSV while the response body is actually still gzip-encoded, the user ends up with a gzip-compressed file, but without the .gz extension in the name.

By convention, a file compressed with gzip should have the .gz extension. Alternatively, the file could be decompressed on the fly during the download and stored without the .gz extension in the name. I've implemented both solutions in the past.

It is true that certain valid combinations of the content-type, content-disposition and content-encoding response headers might solve this in a way consistent with the two common scenarios mentioned above (EITHER compressed with .gz extension OR uncompressed without that extension). However, Azul has no way of affecting what headers Google returns when DB makes a request to Google. TDR may be able to bake certain headers into the signed URL but, again, Azul just returns the signed that it receives from TDR.

I've raised this before with the Broad but got nowhere: https://github.com/DataBiosphere/azul/issues/4838

hannes-ucsc commented 11 months ago

I'm afraid there is nothing the Azul team can do.

achave11-ucsc commented 11 months ago

Assignee to try to raise this again with the Broad.

bvizzier-ucsc commented 9 months ago

Still under investigation on the Broad side.