delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.36k stars 416 forks source link

updates of <struct> types not supported/undocumented #1903

Open maciejskorski opened 1 year ago

maciejskorski commented 1 year ago

Environment

Delta-rs version:

0.13.0

Binding:

Python

Environment:


Bug

What happened:

It's unclear how to update datatypes, and if this is supported at all. Looking into the DataFussion engine I would expect this syntax to work, but can't make it 😞

dt.update(updates={"my_struct":"Struct(1,2)"})

What you expected to happen:

Update the type.

How to reproduce it:

# !pip install deltalake

import pandas as pd
from deltalake import write_deltalake, DeltaTable

df = pd.DataFrame.from_records([
     {"name": "Alice", "age": 25, "gender": "Female", "my_struct": {"a": 1, "b": 2}},
     {"name": "Bob", "age": 30, "gender": "Male", "my_struct": {"a": 3, "b": 4}}
])
write_deltalake('./db/my_table', df)
dt = DeltaTable('./db/my_table')
dt.to_pandas()

dt.update(updates={"my_struct":"Struct(1,2)"},predicate="name='Bob'") # seems compatible but fails
# ValueError: arguments need to have the same data type

More details:

ion-elgreco commented 1 year ago

@maciejskorski what happens if you pass the struct as json string representation?

maciejskorski commented 1 year ago

@maciejskorski what happens if you pass the struct as json string representation?

@ion-elgreco I tried but it complains about SQL incompatibility. Per my research - may not be up to date what I found - the syntax must be SQL in Data Fusion's dialect.

r3stl355 commented 1 year ago

I've been debugging this but can't figure out what's happening. Looks like it's parsing struct(1,2) correctly (it seems to be using a Big Query sintax but that should not matter) and then this mysterious error ValueError: arguments need to have the same data type comes up - I can find it in any base code, including Arrow (or even Google 😄 ) .

I tried different options with explicit schema and nullable settings - the same thing. Syntax like STRUCT<a bigint, b bigint>, STRUCT(1 as a, 2 as b), etc are not accepted, probably by Data Fusion.