delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.14k stars 379 forks source link

TableMerger - when_matched_delete() fails when Column names contain special characters #2438

Closed mkp-jansen closed 4 months ago

mkp-jansen commented 4 months ago

Environment

Delta-rs version: 0.16.4

Binding: python

Environment:


Bug

What happened: At our company we are currently thinking about using delta for our sensor data. The package delta-rs provides pretty much all the functionality we need. However, for reasons I won't be able to change we often have column names with two dashes, e.g. "y--1" ("y-1" works). We need to be able to delete data from the delta lake. When using the TableMerger this fails as shown in the example below.

In the documentation it says the following:

Column names with special characters, such as numbers or spaces should be encapsulated in backticks: "target.123column" or "target.my column"

However, there is no argument in "when_matched_delete()" to specifiy the columns with special characters.

What you expected to happen:

I guess the desired behaviour would be that you can simply delete the matching rows, even when the column names contain special characters.

I would be happy to give a fix a shot (also in rust) - but I would need some guidance along the way.

How to reproduce it:

from deltalake import DeltaTable, write_deltalake
import pyarrow as pa

data = pa.table({"x": [1, 2, 3], "y--1": [4, 5, 6]})
write_deltalake("tmp", data)
dt = DeltaTable("tmp")
new_data = pa.table({"x": [2, 3]})

(
    dt.merge(
        source=new_data,
        predicate='target.x = source.x',
        source_alias='source',
        target_alias='target')
    .when_matched_delete()
    .execute()
)
mkp-jansen commented 4 months ago

I forgot to add the error message:


---------------------------------------------------------------------------
DeltaError                                Traceback (most recent call last)
Cell In[6], line 16
      6 dt = DeltaTable("tmp")
      7 new_data = pa.table({"x": [2, 3]})
      9 (
     10     dt.merge(
     11         source=new_data,
     12         predicate='target.x = source.x',
     13         source_alias='source',
     14         target_alias='target')
     15     .when_matched_delete()
---> 16     .execute()
     17 )

File c:\...\.venv\Lib\site-packages\deltalake\table.py:1778, in TableMerger.execute(self)
   1772 def execute(self) -> Dict[str, Any]:
   1773     """Executes `MERGE` with the previously provided settings in Rust with Apache Datafusion query engine.
   1774 
   1775     Returns:
   1776         Dict: metrics
   1777     """
-> 1778     metrics = self.table._table.merge_execute(
   1779         source=self.source,
   1780         predicate=self.predicate,
   1781         source_alias=self.source_alias,
   1782         target_alias=self.target_alias,
   1783         safe_cast=self.safe_cast,
   1784         writer_properties=self.writer_properties._to_dict()
   1785         if self.writer_properties
   1786         else None,
   1787         custom_metadata=self.custom_metadata,
   1788         matched_update_updates=self.matched_update_updates,
   1789         matched_update_predicate=self.matched_update_predicate,
   1790         matched_delete_predicate=self.matched_delete_predicate,
   1791         matched_delete_all=self.matched_delete_all,
   1792         not_matched_insert_updates=self.not_matched_insert_updates,
   1793         not_matched_insert_predicate=self.not_matched_insert_predicate,
   1794         not_matched_by_source_update_updates=self.not_matched_by_source_update_updates,
   1795         not_matched_by_source_update_predicate=self.not_matched_by_source_update_predicate,
   1796         not_matched_by_source_delete_predicate=self.not_matched_by_source_delete_predicate,
   1797         not_matched_by_source_delete_all=self.not_matched_by_source_delete_all,
   1798     )
   1799     self.table.update_incremental()
   1800     return json.loads(metrics)

DeltaError: Generic DeltaTable error: Schema error: No field named __delta_rs_c_y. Valid fields are source.x, __delta_rs_source, target.x, target."y--1", target.__delta_rs_path, __delta_rs_target, __delta_rs_operation, __delta_rs_c_x, "__delta_rs_c_y--1", __delta_rs_delete, __delta_rs_target_insert, __delta_rs_target_update, __delta_rs_target_delete, __delta_rs_target_copy.
mkp-jansen commented 4 months ago

Wow - that was fast! Thanks