Enable schema evolution for `merge` write disposition with `delta` table format

jorritsandbrink commented 1 month ago

Description

enables schema evolution (adding new columns) for the merge write disposition with the delta table format
increases minimum deltalake version to access add_columns method
allows to pass a schema name to get_delta_tables for pipelines with multiple schemas (that may explain problem with "missing" delta tables)

Related Issues

Fixes #1739

netlify[bot] commented 1 month ago

Deploy Preview for dlt-hub-docs canceled.

Name	Link
Latest commit	e16945591fbca429cb4ec35aa0b25a6e2109b821
Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/66ccf9932c7d810009062456

jorritsandbrink commented 1 month ago

@rudolfix

if arrow_ds is empty you do not evolve the schema. IMO that should happen. please add a test for it (if arrow_ds.head(1).num_rows == 0:)

Done.

should we update all table schemas like in other destinations where it happens in update_stored_schema? if you agree let's create a ticket for that

Three options:

We let delta-rs do automatic schema evolution.
- This already works for write_deltalake (which we use for the append and replace dispositions).
- This does not yet work for DeltaTable.merge (which we use for the merge write disposition).
- This does not yet work for the "empty batch case".
We manually manage schema evolution.
- In this case I think using update_stored_schema is a good idea.
Mix of 1 and 2.

We currently do 3. 1 is not possible yet, but might become possible when the linked tickets are done (they are already assigned, so could be soon). 2 is possible, but is a bigger burden on our side. Which has your preference?

same thing for truncating tables before the load. this is actually used by refresh option

Okay, then we should probably use it.

rudolfix commented 1 month ago

@jorritsandbrink

So what I'd do: in update_stored_schema

make sure that table prefix == table dir for delta (then we know that each table has a separate "folder"). then the folder structure is good and weird layouts are eliminated

in truncate / drop tables

disable for delta (ie. with an error message) OR
implement dropping and truncating tables properly (all refresh options should work after that)

migrating schema You already have all the building blocks for (2) and it IMO makes sense to migrate tables before we start loading but the priority is low.

dlt-hub / dlt