Modified delta_scan.cpp to add secret discovery and builder setup for Azure. At the moment different builder settings are set as delta-rs and duckdb_azure use different values from the builder/env
Added duckdb_azure as an extension requirement to duckdb_delta
Added two testcases with Azurite
What I tested on top (not checked in):
Access Blob Storage by cli
Query tests on bigger deltalake around 1TB with two layers of partitioning
Limitations/Bugs:
Deltalake with around 1TB data and two layers of partitioning (Serialnumber SN as string and YYYYMM as int, so the pattern for one file of deltalake is partition_sn_yyymm_i5m_v15-3/SN=ZZZZ555/yyyymm=202406/blah.parquet, and it failes during partition discovery. I added below but anonymized the data. The Serialnumber column SN is a string, however it tries to interpret it as an int. I think more complex tests are needed.
D SELECT
*
FROM
delta_scan('az://deltalake/delta/k8s/partition_sn_yyymm_i5m_v15-3/')
WHERE SN='XYZ1234';
Invalid Input Error: Failed to cast value: Could not convert string 'ZZZZ555' to INT32
from delta_log:
{"add":{"path":"SN=8XYZ1337/yyyymm=202405/part-00001-d88a25a9-a3f8-4360-9b12-2e737820fa16-c000.zstd.parquet","partitionValues":{"SN":"8XYZ1337","yyyymm":"202405"},"size":...
Then if just go by second level it takes a huge amount of time, I think it does not push to predicate to the partitions correctly:
SELECT
SN
FROM
delta_scan('az://deltalake/delta/k8s/partition_sn_yyymm_interval_5m/')
where yyyymm=202212;
Thanks for the PR @nfoerster2! I've tinkered with it a little bit and decided to open another PR based on yours that fixes a few things and adds CI for this.
What I did:
What I tested on top (not checked in):
Limitations/Bugs: Deltalake with around 1TB data and two layers of partitioning (Serialnumber SN as string and YYYYMM as int, so the pattern for one file of deltalake is partition_sn_yyymm_i5m_v15-3/SN=ZZZZ555/yyyymm=202406/blah.parquet, and it failes during partition discovery. I added below but anonymized the data. The Serialnumber column SN is a string, however it tries to interpret it as an int. I think more complex tests are needed.
Invalid Input Error: Failed to cast value: Could not convert string 'ZZZZ555' to INT32
from delta_log:
{"add":{"path":"SN=8XYZ1337/yyyymm=202405/part-00001-d88a25a9-a3f8-4360-9b12-2e737820fa16-c000.zstd.parquet","partitionValues":{"SN":"8XYZ1337","yyyymm":"202405"},"size":...
Then if just go by second level it takes a huge amount of time, I think it does not push to predicate to the partitions correctly: