delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.22k stars 394 forks source link

Allow configuring output ordering of Parquet file in DataFusion integration #1655

Open simonvandel opened 1 year ago

simonvandel commented 1 year ago

I would like to be able to set the output_ordering on the Parquet file opener in DataFusion: https://github.com/delta-io/delta-rs/blob/a74589be7c39315360925049c716d1d70b906970/rust/src/delta_datafusion.rs#L417

If I can guarantee that a Parquet file is sorted in a specific order, the DataFusion optimizer should be able to remove SortExec operations at query time.

rtyler commented 1 year ago

Based on the thumbs up from @wjones127 , I'm guessing he understands better what you're asking for.

What kind of interface are you looking for here? Where would we ideally pull this through in a public interface?

simonvandel commented 1 year ago

What kind of interface are you looking for here? Where would we ideally pull this through in a public interface?

Good question. Best case, the sort metadata would be in the Delta metadata, but I don't think such a thing exists. Correct me if I'm wrong.

Perhaps an option would be to add a datafusion_table_provider_with_config method or similar to the DeltaLake struct. It would return a new struct that implements TableProvider threading through the sort columns.