duckdb / duckdb_delta

DuckDB extension for Delta Lake
MIT License
136 stars 14 forks source link

duckdb 1.1 delta reader ignore azure secret #83

Closed djouallah closed 1 month ago

djouallah commented 1 month ago

azure secret works fine when using parquet but are not passed when using delta_scan

this works

token =notebookutils.credentials.getToken('storage')
storage_options={"bearer_token": token, "use_fabric_endpoint": "true"}
import duckdb
duckdb.sql(f"""
INSTALL azure ;
LOAD azure;
CREATE or replace SECRET secret4 (
    TYPE AZURE,
    PROVIDER ACCESS_TOKEN,
    ACCESS_TOKEN '{token}' ,
    ACCOUNT_NAME 'onelake'
);
""")
duckdb.sql(" select * from parquet_scan('abfss://sqlengines@onelake.dfs.fabric.microsoft.com/ETL.Lakehouse/Tables/dbo/results/*.parquet') ")

this does not works ( although it give wrong results as it is not using the delta log to scan only relevant parquet files)

token =notebookutils.credentials.getToken('storage')
storage_options={"bearer_token": token, "use_fabric_endpoint": "true"}
import duckdb
duckdb.sql(f"""
INSTALL azure ;
LOAD azure;
CREATE or replace SECRET secret4 (
    TYPE AZURE,
    PROVIDER ACCESS_TOKEN,
    ACCESS_TOKEN '{token}' ,
    ACCOUNT_NAME 'onelake'
);
""")
duckdb.sql(" select * from delta_scan('abfss://cccccc@onelake.dfs.fabric.microsoft.com/ETL.Lakehouse/Tables/dbo/results') ")
IOException: IO Error: Hit DeltaKernel FFI error (from: While trying to read from delta table: 'abfss://cccc@onelake.dfs.fabric.microsoft.com/ETL.Lakehouse/Tables/dbo/results/'): Hit error: 8 (ObjectStoreError) with message (Error interacting with object store: Generic MicrosoftAzure error: Client error with status 401 Unauthorized: <?xml version="1.0" encoding="utf-8"?><Error><Code>NoAuthenticationInformation</Code><Message>Server failed to authenticate the request. Please refer to the information in the www-authenticate header.
RequestId:f47557aa-c01e-0091-2a10-03480f000000
Time:2024-09-09T23:33:20.7177725Z</Message></Error>)
gdubya commented 1 month ago

It works for me using the credential chain / CLI. Maybe it's something with the token auth?

EDIT: I'm also testing against a normal ADLS account, not Fabric. I'll test against Fabric later this evening.

Tikavdm commented 1 month ago

I agree with @djouallah I cannot get it work with with delta_scan, but seems like parquet_scan works:

Simple example of trying to use the new token-based auth with Azure extension. This works exactly like this with a CONNECTION STRING, but using the new token-based auth I get an error, which I'm not sure how to figure out if it is a security policy on my side or something not working with the latest extension. I'm using Duckdb 1.1.0

credential = AzureMLOnBehalfOfCredential()
os.environ['AZURE_ACCESS_TOKEN'] = credential.get_token("https://storage.azure.com/").token

conn = duckdb.connect(':memory:')
conn.install_extension("delta")
conn.load_extension("delta")
conn.install_extension("azure")
conn.load_extension("azure")
conn.execute("""SET azure_transport_option_type =  curl""")
conn.execute(f"""CREATE SECRET secret1 (
    TYPE AZURE,
    PROVIDER ACCESS_TOKEN,
    ACCESS_TOKEN '{os.getenv('AZURE_ACCESS_TOKEN')}',
    ACCOUNT_NAME '{os.getenv('AZURE_STORAGE_ACCOUNT')}'
);""")

full_path=f'abfss://{CONTAINER_NAME}/{delta_table_path}'
query = f"SELECT * FROM delta_scan('{full_path}') LIMIT 3"
conn.execute(query).fetchdf()

Error I get:

Traceback (most recent call last) Cell In [43], line 37 35 full_path=f'abfss://{CONTAINER_NAME}/{delta_table_path}' 36 query = f"SELECT * FROM delta_scan('{full_path}') LIMIT 3" ---> 37 conn.execute(query).fetchdf() IOException: IO Error: Hit DeltaKernel FFI error (from: While trying to read from delta table: 'abfss://xxx'): Hit error: 8 (ObjectStoreError) with message (Error interacting with object store: Generic MicrosoftAzure error: Client error with status 401 Unauthorized: <?xml version="1.0" encoding="utf-8"?><Error><Code>NoAuthenticationInformation</Code><Message>Server failed to authenticate the request. Please refer to the information in the www-authenticate header.

Tikavdm commented 1 month ago

@djouallah, I'm also just checking: Does using the parquet_scan still provide the correct output, even though it is a delta_table? This can then at least be a workaround (I know you will probably lose some delta table functionality, but as long as it reads the latest version of the data, it is potentially a workaround for now).

djouallah commented 1 month ago

@Tikavdm no, you should never ever use parquet scan to read a delta table, you will get wrong results if the table had delete or compact, optimize operation, I used that example only to show the issue with authentication

gdubya commented 1 month ago

Confirmed i get the same problem using an ACCESS_TOKEN against my storage account, so it sounds like a problem with that feature.

gdubya commented 1 month ago

Oh, that's right. This code needs to be updated to handle access tokens since that is a new feature for the azure extension.

https://github.com/duckdb/duckdb_azure/pull/64

djouallah commented 1 month ago

still same issue, using deltalake extension 0.2

IOException: IO Error: Hit DeltaKernel FFI error (from: While trying to read from delta table: 'abfss://sqlengines@onelake.dfs.fabric.microsoft.com/ETL.Lakehouse/Tables/T2/chdb/'): Hit error: 8 (ObjectStoreError) with message (Error interacting with object store: The operation lacked valid authentication credentials for path ETL.Lakehouse/Tables/T2/chdb/_delta_log/_last_checkpoint: Client error with status 401 Unauthorized: <?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidAuthenticationInfo</Code><Message>Server failed to authenticate the request. Please refer to the information in the www-authenticate header.
RequestId:d2946d3c-601e-001f-7194-0407ae000000
Time:2024-09-11T21:48:26.8575298Z</Message><AuthenticationErrorDetail>Issuer validation failed. Issuer did not match.</AuthenticationErrorDetail></Error>)
gdubya commented 1 month ago

@djouallah have you tried the nightly?

djouallah commented 1 month ago

@gdubya no but i am using the latest delta extension which 0.2, it should have the fix right

gdubya commented 1 month ago

Ah, yes, you're right (according to https://github.com/duckdb/duckdb_delta/releases/tag/v0.2.0). Weird. I'll test it again.

hirsimaki-markus commented 1 month ago

Is this a related issue #91? And if there is some workaround that applies to #91 it would be great

gdubya commented 1 month ago

@hirsimaki-markus no, that's a different issue. This one is specifically related to access tokens, but you're using a service principal.

Tikavdm commented 1 month ago

For the record mine works now!! Thanks @gdubya . So token-based authentication works with the latest duckdb 1.1.0 and the updated extension of delta.

gdubya commented 1 month ago

@djouallah It looks like is now a Fabric issue, so you can reopen #43

I tested by copying a Delta table from Fabric to a regular ADLS account. ADLS works, Fabric does not.