delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
1.97k stars 365 forks source link

Unable to specify columns with a dot in the name in predicate #2624

Open emanueledomingo opened 3 days ago

emanueledomingo commented 3 days ago

Environment

Delta-rs version:

How do i find the delta-rs version as a python user?

Binding: 0.18.1

Environment:


Bug

What happened: I cannot use a predicate containing a column with a dot in the name, like " \"Product.Id\" = '1' " when writing with rust engine. It's being interpreted as "Product"."Id" instead of "Product.Id".

What you expected to happen: correctly parse the column name with the dot

How to reproduce it:

import deltalake
import pyarrow as pa

ta = pa.Table.from_pydict(
    {
        "Product.Id": ['x-0', 'x-1', 'x-2', 'x-3'],
    }
)

fp = "./resources/path/to/table"

deltalake.write_deltalake(
    table_or_uri=fp,
    data=ta,
    partition_by=["Product.Id"],
    engine="rust",
    mode="overwrite",
    predicate="\"Product.Id\" = 'x-1'"
)

More details:

Here the stacktrace:

DeltaError                                Traceback (most recent call last)
Cell In[89], line 12
      4 ta = pa.Table.from_pydict(
      5     {
      6         "Product.Id": ['x-0', 'x-1', 'x-2', 'x-3'],
      7     }
      8 )
     10 fp = "./resources/path/to/table"
---> 12 deltalake.write_deltalake(
     13     table_or_uri=fp,
     14     data=ta,
     15     partition_by=["Product.Id"],
     16     engine="rust",
     17     mode="overwrite",
     18     predicate="\"Product.Id\" = 'x-1'"
     19 )

File ~/mambaforge/envs/delta/lib/python3.12/site-packages/deltalake/writer.py:304, in write_deltalake(table_or_uri, data, schema, partition_by, mode, file_options, max_partitions, max_open_files, max_rows_per_file, min_rows_per_group, max_rows_per_group, name, description, configuration, schema_mode, storage_options, partition_filters, predicate, large_dtypes, engine, writer_properties, custom_metadata)
    301     return
    303 data = RecordBatchReader.from_batches(schema, (batch for batch in data))
--> 304 write_deltalake_rust(
    305     table_uri=table_uri,
    306     data=data,
    307     partition_by=partition_by,
    308     mode=mode,
    309     table=table._table if table is not None else None,
    310     schema_mode=schema_mode,
    311     predicate=predicate,
    312     name=name,
    313     description=description,
    314     configuration=configuration,
    315     storage_options=storage_options,
    316     writer_properties=(
    317         writer_properties._to_dict() if writer_properties else None
    318     ),
    319     custom_metadata=custom_metadata,
    320 )
    321 if table:
    322     table.update_incremental()

DeltaError: Generic DeltaTable error: Schema error: No field named "Product"."Id". Valid fields are "88e03a2f-8d4f-407c-98de-cb67462708d2"."Product.Id".

It seems that the predicate splits the column by the dot and then the sql backend (datafusion i suppose) interpret the first part as table name

emanueledomingo commented 2 days ago

I made some further trials and i got: