duckdb / duckdb_azure

Azure extension for DuckDB
MIT License
50 stars 17 forks source link

Querying data from public $web container without authentication #46

Closed dlksmc closed 8 months ago

dlksmc commented 8 months ago

Thanks for the great extension! I am using Azure Storage to host a static website. I have data stored within the public $web container (where the html is hosted) that I can successfully query in DuckDB using a connection_string containing an AccountKey or SharedAccessSignature (as described in the extensions docs):

CREATE SECRET secret1 (
    TYPE AZURE,
    CONNECTION_STRING '⟨value⟩'
);

I would like to query the data without authentication though. I've tried removing the AccountKey or SharedAccessSignature portion of the secret config above, and tried the config suggested in the docs for authentication-less access:

CREATE SECRET secret2 (
    TYPE AZURE,
    PROVIDER CONFIG,
    ACCOUNT_NAME '⟨storage account name⟩'
);

Unfortunately, both attempts result in an error: NoAuthenticationInformationReason Phrase: Server failed to authenticate the request. Please refer to the information in the www-authenticate header..

I've also tried using https:// instead of az://. Without establishing any secret, from https://{storage_account}.web.core.windows.net/my_table.parquet/partition_1=1/partition_2=hello/part-0.parquet' works. But from 'https://{storage_account}.web.core.windows.net/my_table.parquet/**/*.parquet' gives a 404 error (the requested content does not exist). In other words, I'm unable to figure out how to read the full Parquet with Hive-style partitions when using https://.

Am I doing something wrong, or is this not supported?

quentingodeau commented 8 months ago

Hi @dlksmc, can you provide the public url that I can try to take a look ?

dlksmc commented 8 months ago

Hi @quentingodeau. Here's a public example:

create or replace secret secret_mtcars (
  type azure,
  connection_string 'DefaultEndpointsProtocol=https;AccountName=reztest;SharedAccessSignature=sp=rl&st=2024-03-01T03:51:50Z&se=2024-03-01T11:51:50Z&spr=https&sv=2022-11-02&sr=c&sig={sas};EndpointSuffix=core.windows.net'
);

-- Works (with {sas})
select *
from 'az://$web/mtcars.parquet/**/*.parquet'

-- Works
select *
from 'https://reztest.z13.web.core.windows.net/mtcars.parquet/am=0/gear=3/part-0.parquet'

-- Doesn't work
select *
from 'https://reztest.z13.web.core.windows.net/mtcars.parquet/**/*.parquet'
samansmink commented 8 months ago

hi @dlksmc

one of the problems here is that you are trying to do a glob on an http url, this can not work since duckdb will not know how to do file listing over raw http. You need to use az:// urls for that.

I'm able to do:

CREATE SECRET secret2 (
·     TYPE AZURE,
·     PROVIDER CONFIG,
·     ACCOUNT_NAME '<acc>'
‣ );

then query a public file:

FROM 'az://bucket/file.csv'

alternatively, using fully qualified urls we can now do:

from 'az://<account>.blob.core.windows.net/<container>/file.csv';

However, globbing without authentication is not possible atm, and i'm not sure thats even possible:

quentingodeau commented 8 months ago

Yes after some additional check I think you will not be able to do that but you can still generates a SAS key with read permission, like this it does as if you have public data and you can control the permission

dlksmc commented 8 months ago

Thanks for your responses, @quentingodeau and @samansmink. It turns out I needed to change the access level for the container to allow anonymous read access:

image

I thought enabling the $web container already did that, but apparently not. After updating the access, the secret2 version where you just provide the account_name works!