apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.54k stars 1.03k forks source link

Adapt column statistics API #717

Open Dandandan opened 2 years ago

Dandandan commented 2 years ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. While looking at adding support for more statistics on the Delta Lake TableProvider implementation I bumped into some limitation in our statistics API.

Currently columnstatistics is a Option<Vec<ColumnStatistics>>.

https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/datasource/datasource.rs#L37

So, it should return the statistics by (correct) index regardless of the order in the files.

Describe the solution you'd like Either:

FWIW, Delta Lake / delta-rs takes the first approach and seems straightforward to implement and use.

Describe alternatives you've considered

Additional context

Dandandan commented 2 years ago

Closing, seeing this could be done with the schema on table provider instead.

rdettai commented 2 years ago

@Dandandan in #965 I used the schema from the ExecutionPlan trait and it worked fine. But I do agree that it might be better to come up with at data structure that helps asserting that the column_statistics vector is well aligned on the schema fields vector (same size, same types...). I'm adding this as an item in #997, so if you want to close this for now that's fine by me 😃