duckdb / duckdb_delta

DuckDB extension for Delta Lake
MIT License
88 stars 8 forks source link

Base functionality to use duckdb_delta on Azure Storage #37

Closed nfoerster2 closed 1 day ago

nfoerster2 commented 3 days ago

What I did:

What I tested on top (not checked in):

Limitations/Bugs: Deltalake with around 1TB data and two layers of partitioning (Serialnumber SN as string and YYYYMM as int, so the pattern for one file of deltalake is partition_sn_yyymm_i5m_v15-3/SN=ZZZZ555/yyyymm=202406/blah.parquet, and it failes during partition discovery. I added below but anonymized the data. The Serialnumber column SN is a string, however it tries to interpret it as an int. I think more complex tests are needed.

D SELECT
      *
  FROM
      delta_scan('az://deltalake/delta/k8s/partition_sn_yyymm_i5m_v15-3/')
  WHERE SN='XYZ1234';

Invalid Input Error: Failed to cast value: Could not convert string 'ZZZZ555' to INT32

from delta_log: {"add":{"path":"SN=8XYZ1337/yyyymm=202405/part-00001-d88a25a9-a3f8-4360-9b12-2e737820fa16-c000.zstd.parquet","partitionValues":{"SN":"8XYZ1337","yyyymm":"202405"},"size":...

Then if just go by second level it takes a huge amount of time, I think it does not push to predicate to the partitions correctly:

SELECT
      SN
  FROM
      delta_scan('az://deltalake/delta/k8s/partition_sn_yyymm_interval_5m/')
  where yyyymm=202212;
samansmink commented 1 day ago

Thanks for the PR @nfoerster2! I've tinkered with it a little bit and decided to open another PR based on yours that fixes a few things and adds CI for this.