delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.25k stars 401 forks source link

Add replaceWhere functionality #1957

Closed MrPowers closed 6 months ago

MrPowers commented 10 months ago

Description

PySpark has a cool replaceWhere function that lets you override existing data in a Delta table that matches a predicate with new data. Here's an example of the replaceWhere functionality:

df2 = spark.createDataFrame(
    [
        ("x", 7),
        ("y", 8),
        ("z", 9),
    ]
).toDF("letter", "number")

(
    df2.write.format("delta")
    .option("replaceWhere", "number >= 2")
    .mode("overwrite")
    .save("tmp/my_data")
)

What do folks think about adding replaceWhere functionality to Python deltalake.

It's possible that the Rust predicate argument in write_deltalake already exposes this functionality.

ion-elgreco commented 10 months ago

I exposed the predicate parameter for the rust engine writer but it's currently not doing anything because the functionality in Rust is not built yet

r3stl355 commented 10 months ago

take

r3stl355 commented 10 months ago

I'll give this a try

r3stl355 commented 10 months ago

WriteBuilder uses predicate: Option<String> but has no implementation for it yet whereas DeleteBuilder uses predicate: Option<Expression>. I suggest harmonising by changing WriteBuilder to use predicate: Option<Expression>. Though this is a breaking change, predicate handling is not implemented in WriteBuilder so changing the type should not cause issues

roeap commented 10 months ago

It would be great to do this usig logical expressions rather then the physical ones - much like @Blajda recently updated for merge. The good thing there is we get some type coercion for free, which has been a hassle with expressions.

In python we will likely have to accept strings and do the parsing..

ion-elgreco commented 10 months ago

@roeap I think we can start allowing arrow expressions as input, which we can serialize as substrait and then deserialize with Datafusion-substrait

roeap commented 10 months ago

This would be a great goal, but I would say lets be consistent in that and make a deliberate API choice.

I.e not have substrait supported in one method but not the other...

Good news is substrait plans are of course logical plans :)

r3stl355 commented 10 months ago

I'll try that @roeap. As for

It would be great to do this usig logical expressions rather then the physical ones - much like @Blajda recently updated for merge.

is this the David's PR you are referring to? https://github.com/delta-io/delta-rs/pull/1969

ion-elgreco commented 10 months ago

@roeap we should be able to add this to merge, update, delete and write and then just add the conversion inside the pyo3 binding, so it's a Python only feature.

roeap commented 10 months ago

@r3stl355 its #1720 had been up for a while before it got merged.

@ion-elgreco - sure to get started, and as you said right now this could just be internal. Substrait is a nice feature for rust as well, of course as alternative path since we are lookig to integrate into datafusions internal planning.