delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
1.98k stars 365 forks source link

schema merging doesn't work when overwriting with a predicate #2567

Closed polivbr closed 5 days ago

polivbr commented 3 weeks ago

Environment

Delta-rs version:

0.17.4

Binding:

Python


Bug

What happened:

I attempted to update a table from a Polars DataFrame with mode="overwrite" and a predicate to use for replacement. The DataFrame had a subset of the columns that are in the table. While the rows matching the predicate are successfully replaced with the new data, the table's schema becomes the schema of the DataFrame, rather than being merged with the existing schema.

What you expected to happen:

The original table schema is preserved.

How to reproduce it:

1) Create a table with a set of columns 2) Write to that same table with:

ion-elgreco commented 3 weeks ago

@polivbr please create a reproducible example

polivbr commented 3 weeks ago

Here you go:

import polars as pl
import deltalake as dl

df = pl.DataFrame({'a': [1, 2, 3, 4], 'b': [1, 1, 2, 2], 'c': [10, 11, 12, 13]})

df.write_delta("test_table")

df2 = pl.DataFrame({'a': [100, 200, 300], 'b': [1, 1, 1]})

df2.write_delta(
    "test_table",
    mode="overwrite",
    delta_write_options={
        "predicate": "b = 1",
        "schema_mode": "merge",
        "engine": "rust"
    }
)

table = dl.DeltaTable("test_table")
schema = table.schema()

print(schema)

# OUTPUT:
# Schema([Field(a, PrimitiveType("long"), nullable=True), Field(b, PrimitiveType("long"), nullable=True)])
#
# Note that Field c is absent