delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.03k stars 365 forks source link

ValueError: Partition value cannot be parsed from string. #2380

Closed thomasfrederikhoeck closed 3 months ago

thomasfrederikhoeck commented 3 months ago

Environment

Delta-rs version: Main including https://github.com/delta-io/delta-rs/commit/6f81b8034dbef7a0120e354e178f0c859564465e Binding: python

Environment:


Bug

What happened: When I try to create a checkpoint on a table partioned by timestamp I'm hit with a ValueError. Note that I have build from master including https://github.com/delta-io/delta-rs/pull/2357:

import pandas as pd
from datetime import datetime
import deltalake as dl

dates = pd.date_range(datetime(2021,1,1,3,4,6,3),datetime(2021,1,3,3,4,6))

df = pd.DataFrame({"time":dates, "a":[i for i in range(len(dates))]})

schema = dl.schema.Schema(fields=[
    dl.schema.Field("time",dl._internal.PrimitiveType.from_json('"timestamp"')),
    dl.schema.Field("a",dl._internal.PrimitiveType.from_json('"integer"'))
    ]
) 

write_deltalake("mytable",df, schema=schema,partition_by="time")
dt = DeltaTable("mytable")
dt.create_checkpoint()

which gives:

ValueError: Partition value 2021-01-02 03:04:06.000003Z cannot be parsed from string.

What you expected to happen: That the checkpoint was created. How to reproduce it: Run code above

More details:

thomasfrederikhoeck commented 3 months ago

it appears that the Z is missing in the parsing: https://github.com/delta-io/delta-rs/blob/6f81b8034dbef7a0120e354e178f0c859564465e/crates/core/src/protocol/checkpoints.rs#L444

ion-elgreco commented 3 months ago

@thomasfrederikhoeck it shouldn't write the partition values with a Z. Also my PR didn't touch the partition value serialization.

ion-elgreco commented 3 months ago

@thomasfrederikhoeck This issue seems to be related to how pyarrow engine is serializing the partition values

thomasfrederikhoeck commented 3 months ago

Yes it appears that pyarrow serialize timestamp with Z while and timestampNtz without.

import pyarrow as pa
import pytz

tz = "UTC"

def get_data(with_tz):
    tzinfo = pytz.timezone(tz) if  with_tz else None
    dates = pd.date_range(
        datetime(2021,1,1,3,4,6,3, tzinfo=tzinfo),
        datetime(2021,1,3,3,4,6, tzinfo=tzinfo)
        )
    return pd.DataFrame({"time":dates, "a":[i for i in range(len(dates))]})

schema = pa.schema(
        [
            ("time", pa.timestamp("us")),
            ("a", pa.int64()),
        ]
    )
dt = DeltaTable.create(
        "mytable_timestampNtz", schema=schema, partition_by=["time"]
    )

write_deltalake("mytable_timestampNtz",get_data(with_tz=False), partition_by="time", mode="append")
print(dt.schema())
schema = pa.schema(
        [
            ("time", pa.timestamp("us",tz)),
            ("a", pa.int64()),
        ]
    )
dt = DeltaTable.create(
        "mytable_timestamp", schema=schema, partition_by=["time"]
    )

write_deltalake("mytable_timestamp",get_data(with_tz=True), partition_by="time", mode="append")
print(dt.schema())

>Schema([Field(time, PrimitiveType("timestampNtz"), nullable=True), Field(a, PrimitiveType("long"), nullable=True)])
>Schema([Field(time, PrimitiveType("timestamp"), nullable=True), Field(a, PrimitiveType("long"), nullable=True)])

image

thomasfrederikhoeck commented 3 months ago

@ion-elgreco I wanted to try the rust engine but the problem is that it serialize like this which is invalid on Windows where you can't have colon (:) it the folder or file name: 2021-01-01 03:04:06.000003

OSError: Generic LocalFileSystem error: Unable to open file C:\projects\delta-rs\mytable_timestamp\time(=2021-01-02 03:04:06.000003\part-00001-a361470e-2514-4309-ae6f-153e877e3f51-c000.snappy.parquet#1: The filename, directory name, or volume label syntax is incorrect. (os error 123)

ion-elgreco commented 3 months ago

@thomasfrederikhoeck can you make a separate issue for that

thomasfrederikhoeck commented 3 months ago

Yes, https://github.com/delta-io/delta-rs/issues/2382 :-) @ion-elgreco

ion-elgreco commented 3 months ago

Yes it appears that pyarrow serialize timestamp with Z while and timestampNtz without.

import pyarrow as pa
import pytz

tz = "UTC"

def get_data(with_tz):
    tzinfo = pytz.timezone(tz) if  with_tz else None
    dates = pd.date_range(
        datetime(2021,1,1,3,4,6,3, tzinfo=tzinfo),
        datetime(2021,1,3,3,4,6, tzinfo=tzinfo)
        )
    return pd.DataFrame({"time":dates, "a":[i for i in range(len(dates))]})

schema = pa.schema(
        [
            ("time", pa.timestamp("us")),
            ("a", pa.int64()),
        ]
    )
dt = DeltaTable.create(
        "mytable_timestampNtz", schema=schema, partition_by=["time"]
    )

write_deltalake("mytable_timestampNtz",get_data(with_tz=False), partition_by="time", mode="append")
print(dt.schema())
schema = pa.schema(
        [
            ("time", pa.timestamp("us",tz)),
            ("a", pa.int64()),
        ]
    )
dt = DeltaTable.create(
        "mytable_timestamp", schema=schema, partition_by=["time"]
    )

write_deltalake("mytable_timestamp",get_data(with_tz=True), partition_by="time", mode="append")
print(dt.schema())

>Schema([Field(time, PrimitiveType("timestampNtz"), nullable=True), Field(a, PrimitiveType("long"), nullable=True)])
>Schema([Field(time, PrimitiveType("timestamp"), nullable=True), Field(a, PrimitiveType("long"), nullable=True)])

image

@thomasfrederikhoeck for this one, can you also create a separate issue? : P

thomasfrederikhoeck commented 3 months ago

@ion-elgreco Done https://github.com/delta-io/delta-rs/issues/2384 :-)