duckdb / duckdb-wasm

WebAssembly version of DuckDB
https://shell.duckdb.org
MIT License
1.28k stars 130 forks source link

SELECT fails on pre-signed URLs due to CORS errors in DuckDB-Wasm #1852

Open coji opened 1 month ago

coji commented 1 month ago

What happens?

When executing a SELECT statement in DuckDB-Wasm on a data source accessed via a pre-signed URL (especially those created for GET requests), the operation fails due to CORS errors. This prevents querying data stored in locations that require pre-signed URLs for access.

To Reproduce

  1. Use the following pre-signed URL for a Parquet file (valid for 7 days from 2024-09-14):
    https://91ff95bcb91fbfa1b1c5c356262b1fe4.r2.cloudflarestorage.com/techtalk/world_populations.parquet?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=0d9126cf0fed3ae3c00f20ceb2bb97c3%2F20240914%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240914T091120Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=1bddf8fcc77e83aa20ffa827e771cea7310af373354af06c5ac58f2e181f0182
  2. In DuckDB-Wasm or at shell.duckdb.org, attempt to execute a SELECT statement on this data source using the pre-signed URL.

Example SQL query:

SELECT * FROM parquet_scan('https://91ff95bcb91fbfa1b1c5c356262b1fe4.r2.cloudflarestorage.com/techtalk/world_populations.parquet?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=0d9126cf0fed3ae3c00f20ceb2bb97c3%2F20240914%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240914T091120Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=1bddf8fcc77e83aa20ffa827e771cea7310af373354af06c5ac58f2e181f0182') LIMIT 10;
  1. Observe that the query fails due to a CORS error, and the data is not accessible.

Note: I tried this query on shell.duckdb.org, and it failed to access the data.

Additional context: The current behavior seems to be:

  1. DuckDB-Wasm attempts a HEAD request on the pre-signed URL.
  2. The HEAD request fails with a CORS error.
  3. An exception is thrown by xhr.send(null), which is not caught.
  4. The code for performing a range GET request is never reached.
  5. The SELECT statement fails, unable to access the data.

This behavior was observed both in a local DuckDB-Wasm implementation and on shell.duckdb.org.

Importantly, the bucket's CORS policy is set according to the documentation:

[
  {
    "AllowedOrigins": [
      "*"
    ],
    "AllowedMethods": [
      "GET",
      "HEAD"
    ],
    "AllowedHeaders": [
      "*"
    ],
    "ExposeHeaders": [
      "*"
    ],
    "MaxAgeSeconds": 3000
  }
]

Despite this CORS policy allowing both GET and HEAD methods from any origin, the issue persists. This suggests that the problem might be related to how DuckDB-Wasm handles the pre-signed URLs rather than the bucket's CORS configuration.

A possible solution might be to skip the HEAD request for pre-signed URLs or implement exception handling to proceed with the range GET request even if the HEAD request fails.

Browser/Environment:

Chrome 128.0.6613.138

Device:

M2 Macbook Air

DuckDB-Wasm Version:

1.28.1-dev278.0

DuckDB-Wasm Deployment:

shell.duckdb.org

Full Name:

Koji Mizoguchi

Affiliation:

TechTalk Inc.

carlopi commented 1 month ago

Using curl, in native:

% curl -I 'https://91ff95bcb91fbfa1b1c5c356262b1fe4.r2.cloudflarestorage.com/techtalk/world_populations.parquet?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=0d9126cf0fed3ae3c00f20ceb2bb97c3%2F20240914%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240914T091120Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=1bddf8fcc77e83aa20ffa827e771cea7310af373354af06c5ac58f2e181f0182'
HTTP/1.1 403 Forbidden
Date: Sat, 14 Sep 2024 11:09:08 GMT
Content-Type: text/plain;charset=UTF-8
Connection: keep-alive
Server: cloudflare
CF-RAY: 8c2fec512cac66ea-AMS

So I would interpret that as the HEAD request not being valid at all, it's not a CORS problem but a problem with presigned URLs that I think do allow only GET methods (I think, unsure, this one for sure).

I would need to check this in the shell, sounds possible an error should be dealt with better and recover from it

seanbirchall commented 1 month ago

I'm having the same issue with R2. I think it might be related to the presigned URL only being valid for whatever method you've specified to access the data. In duckdb wasm when we read or attach to a file in most cases a HEAD and then a GET request are sent. There's a few other issues talking about this https://github.com/duckdb/duckdb-wasm/issues/699. Curious if there's any solution?

Sounds like a possible work around is to create a modified version which sends a ranged GET request first instead of the HEAD request. At which point only GET requests would be sent.

tobilg commented 1 month ago

Maybe check the following: https://stackoverflow.com/a/76836943

Also, can you access the presigned URL via curl w/o Problem?

seanbirchall commented 1 month ago

@tobilg can't curl mine either... sorry if URL expires.

curl -I "https://01bec96ddf135b4f6636692059641ffe.r2.cloudflarestorage.com/scrapeable-data/db_test.duckdb?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=f7f92bbce532f6ead44c019c43921a18%2F20240914%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240914T184231Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=486f72d518388b248d5d9955743ab7a5eb2023c5ef9723c0d5c6afa132b923ac"
HTTP/1.1 403 Forbidden
Date: Sat, 14 Sep 2024 18:45:21 GMT
Content-Type: text/plain;charset=UTF-8
Connection: keep-alive
Server: cloudflare
CF-RAY: 8c3288989eee8fb8-BOS

Maybe I need to get rid of special characters in my bucket and .duckdb in my object.

seanbirchall commented 1 month ago

It looks like this PR should've fixed the issue https://github.com/duckdb/duckdb-wasm/issues/1367 to https://github.com/duckdb/duckdb-wasm/pull/1717

tobilg commented 1 month ago

Have you checked adding the Content-Range header, as outline https://github.com/duckdb/duckdb-wasm/pull/1717#issuecomment-2090268399

coji commented 1 month ago

Have you checked adding the Content-Range header, as outline #1717 (comment)

Thank you for your confirmation. I added the following setting to CORS:

   "ExposeHeaders": ["*"]

but it still results in an error at the time of the HEAD request. I've also updated the issue itself with the above information.

coji commented 1 month ago

this is huge!

https://github.com/duckdb/duckdb-wasm/pull/1856

seanbirchall commented 1 month ago

@coji I tried out this example app from @e1arikawa using your above example but still cannot get it to work. So maybe it is a CORS issue, or this app is incomplete for testing this, or the PR doesn't work for our desired use case?

Either way maybe give it a try too and see if you can query your S3 parquet file. I still see

duckdb-browser-eh.worker.js:1 Access to XMLHttpRequest at 'https://91ff95bcb91fbfa1b1c5c356262b1fe4.r2.cloudflarestorage.com/techtalk/world_populations.parquet?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=0d9126cf0fed3ae3c00f20ceb2bb97c3%2F20240914%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240914T091120Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=1bddf8fcc77e83aa20ffa827e771cea7310af373354af06c5ac58f2e181f0182' from origin 'http://127.0.0.1:5500' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource.