Closed thomasfrederikhoeck closed 3 months ago
it appears that the Z
is missing in the parsing:
https://github.com/delta-io/delta-rs/blob/6f81b8034dbef7a0120e354e178f0c859564465e/crates/core/src/protocol/checkpoints.rs#L444
@thomasfrederikhoeck it shouldn't write the partition values with a Z. Also my PR didn't touch the partition value serialization.
@thomasfrederikhoeck This issue seems to be related to how pyarrow engine is serializing the partition values
Yes it appears that pyarrow serialize timestamp
with Z
while and timestampNtz
without.
import pyarrow as pa
import pytz
tz = "UTC"
def get_data(with_tz):
tzinfo = pytz.timezone(tz) if with_tz else None
dates = pd.date_range(
datetime(2021,1,1,3,4,6,3, tzinfo=tzinfo),
datetime(2021,1,3,3,4,6, tzinfo=tzinfo)
)
return pd.DataFrame({"time":dates, "a":[i for i in range(len(dates))]})
schema = pa.schema(
[
("time", pa.timestamp("us")),
("a", pa.int64()),
]
)
dt = DeltaTable.create(
"mytable_timestampNtz", schema=schema, partition_by=["time"]
)
write_deltalake("mytable_timestampNtz",get_data(with_tz=False), partition_by="time", mode="append")
print(dt.schema())
schema = pa.schema(
[
("time", pa.timestamp("us",tz)),
("a", pa.int64()),
]
)
dt = DeltaTable.create(
"mytable_timestamp", schema=schema, partition_by=["time"]
)
write_deltalake("mytable_timestamp",get_data(with_tz=True), partition_by="time", mode="append")
print(dt.schema())
>Schema([Field(time, PrimitiveType("timestampNtz"), nullable=True), Field(a, PrimitiveType("long"), nullable=True)])
>Schema([Field(time, PrimitiveType("timestamp"), nullable=True), Field(a, PrimitiveType("long"), nullable=True)])
@ion-elgreco I wanted to try the rust engine but the problem is that it serialize like this which is invalid on Windows where you can't have colon (:
) it the folder or file name: 2021-01-01 03:04:06.000003
OSError: Generic LocalFileSystem error: Unable to open file C:\projects\delta-rs\mytable_timestamp\time(=2021-01-02 03:04:06.000003\part-00001-a361470e-2514-4309-ae6f-153e877e3f51-c000.snappy.parquet#1: The filename, directory name, or volume label syntax is incorrect. (os error 123)
@thomasfrederikhoeck can you make a separate issue for that
Yes, https://github.com/delta-io/delta-rs/issues/2382 :-) @ion-elgreco
Yes it appears that pyarrow serialize
timestamp
withZ
while andtimestampNtz
without.import pyarrow as pa import pytz tz = "UTC" def get_data(with_tz): tzinfo = pytz.timezone(tz) if with_tz else None dates = pd.date_range( datetime(2021,1,1,3,4,6,3, tzinfo=tzinfo), datetime(2021,1,3,3,4,6, tzinfo=tzinfo) ) return pd.DataFrame({"time":dates, "a":[i for i in range(len(dates))]}) schema = pa.schema( [ ("time", pa.timestamp("us")), ("a", pa.int64()), ] ) dt = DeltaTable.create( "mytable_timestampNtz", schema=schema, partition_by=["time"] ) write_deltalake("mytable_timestampNtz",get_data(with_tz=False), partition_by="time", mode="append") print(dt.schema()) schema = pa.schema( [ ("time", pa.timestamp("us",tz)), ("a", pa.int64()), ] ) dt = DeltaTable.create( "mytable_timestamp", schema=schema, partition_by=["time"] ) write_deltalake("mytable_timestamp",get_data(with_tz=True), partition_by="time", mode="append") print(dt.schema()) >Schema([Field(time, PrimitiveType("timestampNtz"), nullable=True), Field(a, PrimitiveType("long"), nullable=True)]) >Schema([Field(time, PrimitiveType("timestamp"), nullable=True), Field(a, PrimitiveType("long"), nullable=True)])
@thomasfrederikhoeck for this one, can you also create a separate issue? : P
@ion-elgreco Done https://github.com/delta-io/delta-rs/issues/2384 :-)
Environment
Delta-rs version: Main including https://github.com/delta-io/delta-rs/commit/6f81b8034dbef7a0120e354e178f0c859564465e Binding: python
Environment:
Bug
What happened: When I try to create a checkpoint on a table partioned by timestamp I'm hit with a
ValueError. Note that I have build from master including
https://github.com/delta-io/delta-rs/pull/2357:which gives:
What you expected to happen: That the checkpoint was created. How to reproduce it: Run code above
More details: