Closed carlopi closed 2 months ago
Thanks for looking this over! I just did some testing, and this new logic does not work with my data. Here are the details:
Here is the query I am testing with (pre-signed url is valid for ~11h):
SELECT * FROM 'https://public-duckdb-range-request-test.s3.us-east-1.amazonaws.com/duckdb_test_data.parquet?response-content-disposition=inline&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEEcaCXVzLWVhc3QtMSJHMEUCIE%2F0aTAuIR%2BLhRcj0YO5sMYFwVg8dptLzR0f7Ep2S%2BlKAiEA33lVQFExAz6JvhM30uHST4YJcxSdXi%2Fgs2KzZpWiN7Yq8QIIkP%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARAFGgwzMzk1MjA0NDA4MDIiDJzkOn7RcDiOHuN6mirFAtL7aB9Q3xpGw4BpolagK6tIImUHMnrblV1ACeg8eNbzQL5IYe3zZ4EiUJ6jG0A3ExTtOEE4V6YUKFoqsdHhu9%2FUaLs9Bnmbd9Dle1HgdYscK7PGayfHnr1shxEbAgzRVbt1zNkpWQVtF1W4GVJziPANfO6DLcdAHQhpxfvHwflHMYoXQU4k0rigFors2o13CHD%2B0xDfUfVuGpS%2BkwChJQpSZBv8cWBWQ8zgnbwtd6a8ylkAqFWWW8NqFsn9sgyWjgaJPxMpFuLo5HzanbA41mOoqvS39x6W4O4NfMh9x3NNFs99aVUir8lnNTzzbyxXa0gXkUk2MDaR797yZaxhew8RCe0q6vMRZeDFPglOIYbPimZ6deRd%2BVhrKjYnVXGuOAMYp0H5aFbZnaV2gyFq4n2jJbFVvEJagMJyPhxaZlQ6u6YE1K0w%2Br2ksQY6swJnmbvWBWJrn7fQ6FsZQ05Sy5g7vWBzRvAdfJfLc3REl682ZnGip8plRDFOMtkLXIewEshw6w7o%2FP3KXqCJGAYQej7mSpPdjmhOy%2BxCrtwJmKOGgKY8jYVMkmnxjiIXeiTJaPwZpCBbOZlGIn6wwgNpIQseIKwniznuAE7AHLaNDonR%2BT0cq5MoU3tyVR0mT9NUOCOljmIB39vgbQmYfgJ9nqWVWqoeOC2Ky4A2FoMEndJWJAD1ovDRBpDwYC%2BjLCm9MZAG%2FfEH%2B2luDcDbOnXivlU8Z0vg3IKXaaUFvSq%2FgqU4A%2FGD037QN%2FnIEAkr7sn5wQwoOp5R1XNuX3rRJkC0xhAuk68mqAu9mihlxYNuGbun0bl97CAcmpOgZY%2Fhll0iVLQw5RLMD4DL3rVsoiQq7XQ3&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20240424T150559Z&X-Amz-SignedHeaders=host&X-Amz-Expires=43200&X-Amz-Credential=ASIAU6DH6ZHRDO7M4AOP%2F20240424%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=9ddb6e8a5ea5d9845794630454ccdcf7207357fc283bc121e145449696696384' where postcode = 10787;
null
as the HEAD request returns a 403, meaning the second block does not get triggeredfile.dataProtocol
is HTTP as I do not specify otherwise (this also prevents the second block from being triggered)Thanks for the feedback, and providing the means for testing in a easier way.
I am trying to check whether removing the line:
- if (((contentLength !== null) && (+contentLength > 1)) || file.dataProtocol == DuckDBDataProtocol.S3) {
+ {
I will do some testing on your testcase.
Yes I think removing that conditional logic could work, although I really don't know much about the S3 protocol and how it is applies here.
I think this issue could be generalized as 'detect support for range requests when HEAD isn't allowed'
@carstonhernke: I changed the code so that now actually compiles, and I think now it should do as intended, can you possibly double check?
The main problem here is that we might be regressing in some cases (say due to punitive performance of downloading the whole file, that might happen by mistake here), but for those you will have to turn off allowFullHttpReads.
Yes, just tested and it behaves as expected! (uses range requests)
@carstonhernke: actually there is a problem: Content-Range is an unsafe header, and as such it will not be issued in Web embeddings of duckdb-wasm.
Possibly the logic might still make sense for node, but at the very least there should be added a setting, defaulting to false, on whether to use Content-Range or not.
To test this, try to visit https://shell.duckdb.org/?bundle=eh/#queries=v0,LOAD-spatial~%0A%0ACREATE-TABLE-nyc-AS%0A----SELECT%0A--------borough%2C%0A--------st_union_agg(geom)-AS-full_geom%2C%0A--------st_area(full_geom)-AS-area%2C%0A--------st_centroid(full_geom)-AS-centroid%2C%0A--------count(*)-AS-count-FROM%0Ast_read('https%3A%2F%2Fraw.githubusercontent.com%2Fduckdb%2Fduckdb_spatial%2Fmain%2Ftest%2Fdata%2Fnyc_taxi%2Ftaxi_zones%2Ftaxi_zones.shp')%0AGROUP-BY-borough~%0ASELECT-borough%2C-area%2C-centroid%3A%3AVARCHAR%2C-count%0AFROM-nyc~ and check the errors in the developer console.
Argh, this seems to be true on Chrome, I guess since it's not CORS safelisted and there the CORS headers are enforced. I might need to do more checks
I think this depends on the CORS configuration/ headers of the remote file which is being requested. Because Content-Range
is not a safelisted header, it won't be accessible to JS code by default in browsers. However, it can be added to the CORS config.
"ExposeHeaders": [ "Access-Control-Allow-Origin", "Content-Range" ]
If the header is not available, but the server only returns a single byte, then I suppose we need another request to get the whole file. So in a case where this header is not available 90% of the time, I think you're right that it should be an optional setting to avoid adding unnecessary overhead.
Mostly by @carstonhernke, minor refactor to unify with follow-up logic.
Fixes https://github.com/duckdb/duckdb-wasm/issues/1367, and closes https://github.com/duckdb/duckdb-wasm/pull/1702 since it implements the same functionality.