delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
1.97k stars 365 forks source link

feat(rust,python): cast each parquet file to delta schema #2615

Open HawaiianSpork opened 1 week ago

HawaiianSpork commented 1 week ago

Description

By casting the read record batch to the delta schema datafusion can read tables where the underlying parquet files can be cast to the desired schema. Fixes:

This can be done now since data-fusion exposes a SchemaAdapter which can be overwritten.

We should note that this makes all times being read by delta-rs as having microsecond precision to match the Delta protocol.

Related Issue(s)

github-actions[bot] commented 1 week ago

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

ion-elgreco commented 1 week ago

@HawaiianSpork can you add a test where we have a delta table that contains parquets with nanosecond timestamps in the files. Maybe just create a parquet table and then use convert to delta?

HawaiianSpork commented 5 days ago

@HawaiianSpork can you add a test where we have a delta table that contains parquets with nanosecond timestamps in the files. Maybe just create a parquet table and then use convert to delta?

@ion-elgreco, I'd be happy to add more tests but want to make sure I create the correct ones. ../test/tests/data/table_with_edge_timestamps data has parquet files with nanosecond timestamp precision, you can see how this change leads datafusion only seeing microsecond precision. Do you think this is fine?

FYI, @wjones127 and @roeap for making the original commit that read the schema from the parquet files: #1266.

rtyler commented 3 days ago

This looks promising, but I would like to update the title if you don't mind for the changelog in the future. Schema evolution is typically understood in the Delta context as changes to the Delta schema (i.e. a transaction commit occurs).

I am understanding this correctly it's more about schema adaptation on read results