dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
https://dlthub.com/docs
Apache License 2.0
2.38k stars 154 forks source link

Enable schema evolution for `merge` write disposition with `delta` table format #1742

Closed jorritsandbrink closed 1 month ago

jorritsandbrink commented 1 month ago

Description

Related Issues

Fixes #1739

netlify[bot] commented 1 month ago

Deploy Preview for dlt-hub-docs canceled.

Name Link
Latest commit e16945591fbca429cb4ec35aa0b25a6e2109b821
Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/66ccf9932c7d810009062456
jorritsandbrink commented 1 month ago

@rudolfix

if arrow_ds is empty you do not evolve the schema. IMO that should happen. please add a test for it (if arrow_ds.head(1).num_rows == 0:)

Done.

should we update all table schemas like in other destinations where it happens in update_stored_schema? if you agree let's create a ticket for that

Three options:

  1. We let delta-rs do automatic schema evolution.
    • This already works for write_deltalake (which we use for the append and replace dispositions).
    • This does not yet work for DeltaTable.merge (which we use for the merge write disposition).
    • This does not yet work for the "empty batch case".
  2. We manually manage schema evolution.
    • In this case I think using update_stored_schema is a good idea.
  3. Mix of 1 and 2.

We currently do 3. 1 is not possible yet, but might become possible when the linked tickets are done (they are already assigned, so could be soon). 2 is possible, but is a bigger burden on our side. Which has your preference?

same thing for truncating tables before the load. this is actually used by refresh option

Okay, then we should probably use it.

rudolfix commented 1 month ago

@jorritsandbrink

So what I'd do: in update_stored_schema

  1. make sure that table prefix == table dir for delta (then we know that each table has a separate "folder"). then the folder structure is good and weird layouts are eliminated

in truncate / drop tables

  1. disable for delta (ie. with an error message) OR
  2. implement dropping and truncating tables properly (all refresh options should work after that)

migrating schema You already have all the building blocks for (2) and it IMO makes sense to migrate tables before we start loading but the priority is low.