delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
1.98k stars 364 forks source link

File options are ignored when writing delta #1444

Closed Dammi87 closed 1 year ago

Dammi87 commented 1 year ago

Environment

Windows 10 Python 3.10.11

Delta-rs version: deltalake 0.9.0 pyarrow 12.0.0 numpy 1.24.3


Bug

What happened: I'm receiving json data from a service which is using nanosecond resolution which I need to store in delta format. It's acceptable to have truncated timestamps so I intended to simply allow that and coerce the timestamps to microsecond resolution. However, I end up with this error

PyDeltaTableError: Schema error: Invalid data type for Delta Lake: Timestamp(Nanosecond, Some("UTC"))

What you expected to happen: I expected the timestamp to be truncated and converted to microseconds.

How to reproduce it:

import io
import pyarrow.json as pj
from deltalake.writer import write_deltalake
from pyarrow.dataset import ParquetFileFormat

def get_obj(content) :
    output = '\n'.join(json.dumps(d) for d in content)
    return io.BytesIO(output.encode())

def arrow(schema, content):
    return pj.read_json(
        get_obj(content),
        parse_options=pj.ParseOptions(
            explicit_schema=schema
        )
    )

content = [
    {'timeStamp': "2022-12-28T00:00:00.3352264Z"},
    {'timeStamp': "2022-12-28T00:00:00.3352264Z"}
]
schema = pa.schema([pa.field('timeStamp', pa.timestamp('ns', tz='UTC'))])
table = arrow(schema, content)

write_options = ParquetFileFormat().make_write_options(use_deprecated_int96_timestamps = False, coerce_timestamps = 'us', allow_truncated_timestamps = True)
write_deltalake('test', table, file_options=write_options)

More details: This is a minimal producible example from the pipeline I'm creating - receiving a stream of json arrays

wjones127 commented 1 year ago

Right now we expect users to cast their data types to ones Delta Lake supports. We may eventually support automatically casting in the future. That's tracked by https://github.com/delta-io/delta-rs/issues/686

Dammi87 commented 1 year ago

Gotcha thanks!

I was aware of the limitation but the only unsupported data-type I was encountering was this damn timestamp, so I hoped that the file_options would save me the work :)

Should I close the issue then?

wjones127 commented 1 year ago

Yeah sorry those truncation options don't work for that. I think we'd like to fold this into the general issue for mapping data types though, rather than treat timestamps specially.

Dammi87 commented 1 year ago

No worries, you guys are doing awesome work, much appreciated

neo4py commented 3 months ago

Gotcha thanks!

I was aware of the limitation but the only unsupported data-type I was encountering was this damn timestamp, so I hoped that the file_options would save me the work :)

Should I close the issue then?

I am getting the same error, but I did not follow what the fix is, can you please clarify? thanks! original_value = "2024-03-11T14:31:32.804589Z" I converted it to datetime.fromisoformat(original_value) I am using this as a column in pandas daatframe and when i print the datatype it shows datetime64[ns, UTC] Also, I am building pyarrow schema from this pandas dataframe and pass it to the write_deltalake function. When I print the datatype from pyarrow it shows timestamp[ns, tz=UTC] I have tried truncating the seconds altogether before creating the pandas dataframe, but to no avail.