Eventual-Inc / Daft

Distributed DataFrame for Python designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
1.82k stars 113 forks source link

Support partition evolution (old files having different partitoning schemes vs new files) #2249

Open jaychia opened 1 month ago

jaychia commented 1 month ago

Is your feature request related to a problem? Please describe.

Currently Daft makes an assumption that all files being retrieved from a given Iceberg table has the same partitioning:

  1. Retrieve current partition spec from the table
  2. Translate any predicates into partition filters (e.g. dt > 1970-02-01 becomes day(dt) > 30)
  3. Apply this partition filter naively to any ScanTasks

However, in certain cases, the partitioning of old data might differ from the current partitoning spec through the process of "partition evolution". For example, if the partitioning used to be month(dt) then the predicate from before should be correctly translated to day(dt) > 30 for new files, but month(dt) > 1 for old files.

See: #2084 for tests

samster25 commented 1 month ago

@jaychia can you merge in the tests behind a pytest skip? I'll take a look after that!

jaychia commented 1 month ago

@jaychia can you merge in the tests behind a pytest skip? I'll take a look after that!

Sounds good, pending merge: https://github.com/Eventual-Inc/Daft/pull/2084