apache / datafusion-ballista

Apache DataFusion Ballista Distributed Query Engine
https://datafusion.apache.org/ballista
Apache License 2.0
1.39k stars 181 forks source link

Ballista should serialize Parquet statistics #14

Open andygrove opened 2 years ago

andygrove commented 2 years ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. When the Ballista scheduler or executor deserializes a ParquetExec it collects the statistics again and this is redundant. We should serialize the statistics to avoid this extra work.

Describe the solution you'd like Add Parquet statistics to serde module.

Describe alternatives you've considered N/A

Additional context N/A

rdettai commented 2 years ago

In apache/arrow-datafusion#962 I am considering the possibility to make the statistics part of the ExecutionPlan trait (and remove them from TableProvider). But I think that not all nodes will have a cached version of the statistics, only those nodes for which it is an expensive operation to fetch them and that know that the they will not change.

We will probably not need the statistics on the executor, because I doubt that any re-optimization will take place there. So it might be an optimization further down the road to optionally leave them out of the serialization in that case.