duckdb / duckdb-wasm

WebAssembly version of DuckDB
https://shell.duckdb.org
MIT License
1.02k stars 110 forks source link

adding logic to detect whether range requests are supported via ContentRange #1717

Closed carlopi closed 2 months ago

carlopi commented 2 months ago

Mostly by @carstonhernke, minor refactor to unify with follow-up logic.

Fixes https://github.com/duckdb/duckdb-wasm/issues/1367, and closes https://github.com/duckdb/duckdb-wasm/pull/1702 since it implements the same functionality.

carstonhernke commented 2 months ago

Thanks for looking this over! I just did some testing, and this new logic does not work with my data. Here are the details:

Here is the query I am testing with (pre-signed url is valid for ~11h):

SELECT * FROM 'https://public-duckdb-range-request-test.s3.us-east-1.amazonaws.com/duckdb_test_data.parquet?response-content-disposition=inline&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEEcaCXVzLWVhc3QtMSJHMEUCIE%2F0aTAuIR%2BLhRcj0YO5sMYFwVg8dptLzR0f7Ep2S%2BlKAiEA33lVQFExAz6JvhM30uHST4YJcxSdXi%2Fgs2KzZpWiN7Yq8QIIkP%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARAFGgwzMzk1MjA0NDA4MDIiDJzkOn7RcDiOHuN6mirFAtL7aB9Q3xpGw4BpolagK6tIImUHMnrblV1ACeg8eNbzQL5IYe3zZ4EiUJ6jG0A3ExTtOEE4V6YUKFoqsdHhu9%2FUaLs9Bnmbd9Dle1HgdYscK7PGayfHnr1shxEbAgzRVbt1zNkpWQVtF1W4GVJziPANfO6DLcdAHQhpxfvHwflHMYoXQU4k0rigFors2o13CHD%2B0xDfUfVuGpS%2BkwChJQpSZBv8cWBWQ8zgnbwtd6a8ylkAqFWWW8NqFsn9sgyWjgaJPxMpFuLo5HzanbA41mOoqvS39x6W4O4NfMh9x3NNFs99aVUir8lnNTzzbyxXa0gXkUk2MDaR797yZaxhew8RCe0q6vMRZeDFPglOIYbPimZ6deRd%2BVhrKjYnVXGuOAMYp0H5aFbZnaV2gyFq4n2jJbFVvEJagMJyPhxaZlQ6u6YE1K0w%2Br2ksQY6swJnmbvWBWJrn7fQ6FsZQ05Sy5g7vWBzRvAdfJfLc3REl682ZnGip8plRDFOMtkLXIewEshw6w7o%2FP3KXqCJGAYQej7mSpPdjmhOy%2BxCrtwJmKOGgKY8jYVMkmnxjiIXeiTJaPwZpCBbOZlGIn6wwgNpIQseIKwniznuAE7AHLaNDonR%2BT0cq5MoU3tyVR0mT9NUOCOljmIB39vgbQmYfgJ9nqWVWqoeOC2Ky4A2FoMEndJWJAD1ovDRBpDwYC%2BjLCm9MZAG%2FfEH%2B2luDcDbOnXivlU8Z0vg3IKXaaUFvSq%2FgqU4A%2FGD037QN%2FnIEAkr7sn5wQwoOp5R1XNuX3rRJkC0xhAuk68mqAu9mihlxYNuGbun0bl97CAcmpOgZY%2Fhll0iVLQw5RLMD4DL3rVsoiQq7XQ3&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20240424T150559Z&X-Amz-SignedHeaders=host&X-Amz-Expires=43200&X-Amz-Credential=ASIAU6DH6ZHRDO7M4AOP%2F20240424%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=9ddb6e8a5ea5d9845794630454ccdcf7207357fc283bc121e145449696696384' where postcode = 10787;
carlopi commented 2 months ago

Thanks for the feedback, and providing the means for testing in a easier way.

I am trying to check whether removing the line:

-                 if (((contentLength !== null) && (+contentLength > 1)) || file.dataProtocol == DuckDBDataProtocol.S3) {
+                {

I will do some testing on your testcase.

carstonhernke commented 2 months ago

Yes I think removing that conditional logic could work, although I really don't know much about the S3 protocol and how it is applies here.

I think this issue could be generalized as 'detect support for range requests when HEAD isn't allowed'

carlopi commented 2 months ago

@carstonhernke: I changed the code so that now actually compiles, and I think now it should do as intended, can you possibly double check?

The main problem here is that we might be regressing in some cases (say due to punitive performance of downloading the whole file, that might happen by mistake here), but for those you will have to turn off allowFullHttpReads.

carstonhernke commented 2 months ago

Yes, just tested and it behaves as expected! (uses range requests)

carlopi commented 2 months ago

@carstonhernke: actually there is a problem: Content-Range is an unsafe header, and as such it will not be issued in Web embeddings of duckdb-wasm.

Possibly the logic might still make sense for node, but at the very least there should be added a setting, defaulting to false, on whether to use Content-Range or not.

carlopi commented 2 months ago

To test this, try to visit https://shell.duckdb.org/?bundle=eh/#queries=v0,LOAD-spatial~%0A%0ACREATE-TABLE-nyc-AS%0A----SELECT%0A--------borough%2C%0A--------st_union_agg(geom)-AS-full_geom%2C%0A--------st_area(full_geom)-AS-area%2C%0A--------st_centroid(full_geom)-AS-centroid%2C%0A--------count(*)-AS-count-FROM%0Ast_read('https%3A%2F%2Fraw.githubusercontent.com%2Fduckdb%2Fduckdb_spatial%2Fmain%2Ftest%2Fdata%2Fnyc_taxi%2Ftaxi_zones%2Ftaxi_zones.shp')%0AGROUP-BY-borough~%0ASELECT-borough%2C-area%2C-centroid%3A%3AVARCHAR%2C-count%0AFROM-nyc~ and check the errors in the developer console.

Argh, this seems to be true on Chrome, I guess since it's not CORS safelisted and there the CORS headers are enforced. I might need to do more checks

carstonhernke commented 2 months ago

I think this depends on the CORS configuration/ headers of the remote file which is being requested. Because Content-Range is not a safelisted header, it won't be accessible to JS code by default in browsers. However, it can be added to the CORS config. "ExposeHeaders": [ "Access-Control-Allow-Origin", "Content-Range" ]

If the header is not available, but the server only returns a single byte, then I suppose we need another request to get the whole file. So in a case where this header is not available 90% of the time, I think you're right that it should be an optional setting to avoid adding unnecessary overhead.