duckdb / duckdb_delta

DuckDB extension for Delta Lake
MIT License
88 stars 8 forks source link

Add support to Delta files stored in ADLS Gen2 #21

Open jegranado opened 1 month ago

jegranado commented 1 month ago

select * FROM delta_scan('abfss://<account>.dfs.core.windows.net/<path>/<delta_table>/');

results in: IOException: IO Error: Hit DeltaKernel FFI error (from: get_default_client in DeltaScanScanBind): Hit error: 5 (GenericError) with message (Generic delta kernel error: Error interacting with object store: Generic parse_url error: feature for MicrosoftAzure not enabled)

mrjsj commented 1 month ago

Did you install and load the duckdb Azure extension?

jegranado commented 1 month ago

Hello,

Yes. I can read .csv files in Azure, but delta_scan fails with the message above.

I can use delta_scan with local Delta tables.

jegranado commented 1 month ago

just had a look at the code - looks like support for s3:// is hardcoded, so I suspect support for abfss:// would have to follow a similar path.

looking forward for this to be available

jegranado commented 1 month ago

if I got it correctly...

static ffi::EngineBuilder* CreateBuilder have to be changed to add the abfs:// possiblity by adding that condition to the path parser.

Then, the builder would have to set specific attributes if Azure type secrets exists. Looking at the duckdb_azure extension codebase, at least the connection_string would have to be setup (depending on which type of Azure secret was found)

I'd love to help myself, but I fall short of C++ skills here

nfoerster2 commented 2 weeks ago

I'm having the same problem, please add the support for azure.

nfoerster2 commented 2 weeks ago

@jegranado I tried to hack it into, build and loaded, however it doesn't change anything, do you have an idea?

https://github.com/nfoerster2/duckdb_delta/blob/a2ddb6c7d65cbae8e7466a4d35c315b2c34b8799/src/functions/delta_scan.cpp#L68

sid2911 commented 2 weeks ago

I'am having the same problem. It works well if I directly query a parquet file within delta table folder, however gives me the same error when I use delta_scan

fibigerg commented 5 days ago

Hi, I tried this with GCS path "gs://" ... And indeed it does not work, while read_parquet works with the same credentials.

However, the error is very strange.

duckdb.duckdb.IOException: IO Error: Hit DeltaKernel FFI error (from: While trying to read from delta table: 'gs://my-bucket/my-table'): Hit error: 8 (ObjectStoreError) with message (Error interacting with object store: Generic GCS error: Error performing token request: Error after 10 retries in 6.122011584s, max_retries:10, retry_timeout:180s, source:error sending request for url (http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token?audience=https%3A%2F%2Fwww.googleapis.com%2Foauth2%2Fv4%2Ftoken): error trying to connect: tcp connect error: Host is down (os error 64))
nfoerster2 commented 4 days ago

if I got it correctly...

static ffi::EngineBuilder* CreateBuilder have to be changed to add the abfs:// possiblity by adding that condition to the path parser.

Then, the builder would have to set specific attributes if Azure type secrets exists. Looking at the duckdb_azure extension codebase, at least the connection_string would have to be setup (depending on which type of Azure secret was found)

I'd love to help myself, but I fall short of C++ skills here

I implemented Azure functionality, however I'm not sure if that is the right way. It uses duckdb Azure plugin and delta rs blob authentication, its complicated as they are both requiring different azure parameters. But its working on Azurite and with Blob access by CLI and connection string/access keys.

https://github.com/nfoerster2/duckdb_delta

samansmink commented 4 days ago

@nfoerster2 Thanks for that implementation. While I agree with you that this is not ideal due to complexity of using both DuckDB's filesystem and the Kernel's internal filesystems, I think its a good idea to merge this. The alternative would be to wait for delta-kernel-rs to support letting DuckDB fully handle all filesystem ops, but that could take a little while still and having some azure auth methods working seems like low hanging fruit in the mean time

I would propose to PR your code into the delta extension, I will add some CI jobs based on Azurite and Minio to ensure this can actually be tested. Feel free to open a PR with your code, I can review and merge it. Otherwise I will open one with your changes and some testing later this week.

nfoerster2 commented 3 days ago

Sure, I can create a PR. I think there are still some bugs. I added two test cases which are working fine, however I also tested some productive data, its a deltalake with around 1TB data and two layers of partitioning (Serialnumber SN as string and YYYYMM as int, so the pattern for one file of deltalake is partition_sn_yyymm_i5m_v15-3/SN=ZZZZ555/yyyymm=202406/blah.parquet, and it failes during partition discovery. I added below but anonymized the data. The Serialnumber column SN is a string, however it tries to interpret it as an int. I think more complex tests are needed.

D SELECT
      *
  FROM
      delta_scan('az://deltalake/delta/k8s/partition_sn_yyymm_i5m_v15-3/')
  WHERE SN='XYZ1234';

Invalid Input Error: Failed to cast value: Could not convert string 'ZZZZ555' to INT32

from delta_log: {"add":{"path":"SN=8XYZ1337/yyyymm=202405/part-00001-d88a25a9-a3f8-4360-9b12-2e737820fa16-c000.zstd.parquet","partitionValues":{"SN":"8XYZ1337","yyyymm":"202405"},"size":...

Then if just go by second level it takes a huge amount of time, I think it does not push to predicate to the partitions correctly:

SELECT
      SN
  FROM
      delta_scan('az://deltalake/delta/k8s/partition_sn_yyymm_interval_5m/')
  where yyyymm=202212;
samansmink commented 1 day ago

PR is up https://github.com/duckdb/duckdb_delta/pull/39

mrjsj commented 23 hours ago

Very nice @samansmink . Been looking forward to this. 😊