delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
1.98k stars 365 forks source link

Unable to write new partitions with type timestamp on tables created with delta-rs 0.10.0 #2631

Open emanueledomingo opened 5 days ago

emanueledomingo commented 5 days ago

Environment

Delta-rs version:

Binding: 0.18.0

Environment:


Bug

What happened:

I have a table written with delta-rs 0.10.0. The schema is:

{
   "type":"struct",
   "fields":[
      {
         "name":"Date",
         "type":"date",
         "nullable":false,
         "metadata":{}
      },
      {
         "name":"Timestamp",
         "type":"timestamp",
         "nullable":false,
         "metadata":{}
      }
   ]
}

I'm triyng to write a new partition on that table with the following schema:

pa.schema(
    [
        ("Day", pa.date32(), False),
        ("Timestamp", pa.timestamp("us"), False),
    ]
)

But i get: DeltaError: Generic DeltaTable error: Writer features must be specified for writerversion >= 7, please specify: TimestampWithoutTimezone.

With deltalake 0.16.2 worked fine. Now i dumped to 0.18.0 and i get this error with tables created with an old delta-rs client.

If i the table is created with newer delta-rs client, this doesn't happen.

How to reproduce it:

  1. Create a table with deltalake==0.10.0
    
    import deltalake as dl
    import pyarrow as pa

dl.version # 0.10.0

ta = pa.Table.from_pydict( { "Date": ["2023-01-01", "2023-01-02"], "Timestamp": ["2023-01-01T14:37:35.386235", "2023-01-01T14:37:35.386235", "2023-01-01T14:37:35.386235"] } )

ta = ta.cast( pa.schema( [ ("Date", pa.date32(), False), ("Timestamp", pa.timestamp("us"), False), ] ) ) dl.write_deltalake( table_or_uri="tmp/table", mode="overwrite", data=ta, )

2. Write a new partition with `deltalake==0.18.0`

```py
import deltalake as dl
import pyarrow as pa

dl.__version__   # 0.18.0

ta = pa.Table.from_pydict(
    {
        "Date": ["2024-06-28"],
        "Timestamp": ["2024-06-28T14:37:35.386235"]
    }
)

ta = ta.cast(
    pa.schema(
        [
            ("Date", pa.date32(), False),
            ("Timestamp", pa.timestamp("us"), False),
        ]
    )
)
dl.write_deltalake(
    table_or_uri="tmp/table",
    mode="overwrite",
    data=ta,
    pertition_filters=[("Date", "=", "2024-06-28"]
)

More details:

  1. Debugging the code (at least from python) i noticed that the table created with delta 0.10 has "timestamp" as a primary type, while new tables now have "timestamp_ntz"
  2. If i add the timezone (for example UTC), even if the table is created with delta 0.10, the write is successful
Josh-Hiz commented 2 days ago

I am additionally have the same issue when writing new partitions with type timestamp in 0.18.1, when time stamps are structured as for example: '2015-10-30T06:40:15.000Z', year-month-dayThour-min-month.000Z

ion-elgreco commented 2 days ago

We fixed a longstanding bug where timestamps where incorrect, this has now been correct and was a backwards incomatible change in some areas, additionally the pyarrow engine however incorrectly writes UTC timestamps as Z, this is something we cannot configure in pyarrow

emanueledomingo commented 2 days ago

Is there a way to migrate the schema from "timestamp" to "timestamp_ntz" without recreating the table? (and load all the historical data)

I tried with schema_mode: overwrite but i get the same error. It seems that delta is unable to write the new "timestamp_ntz" type over the legacy "timestamp".

ion-elgreco commented 2 days ago

@emanueledomingo easiest is to recreate the table at the moment