Open simonvandel opened 1 year ago
Based on the thumbs up from @wjones127 , I'm guessing he understands better what you're asking for.
What kind of interface are you looking for here? Where would we ideally pull this through in a public interface?
What kind of interface are you looking for here? Where would we ideally pull this through in a public interface?
Good question. Best case, the sort metadata would be in the Delta metadata, but I don't think such a thing exists. Correct me if I'm wrong.
Perhaps an option would be to add a datafusion_table_provider_with_config
method or similar to the DeltaLake struct. It would return a new struct that implements TableProvider threading through the sort columns.
I would like to be able to set the
output_ordering
on the Parquet file opener in DataFusion: https://github.com/delta-io/delta-rs/blob/a74589be7c39315360925049c716d1d70b906970/rust/src/delta_datafusion.rs#L417If I can guarantee that a Parquet file is sorted in a specific order, the DataFusion optimizer should be able to remove
SortExec
operations at query time.