Open andygrove opened 3 years ago
seems like something that would be a good fit for design doc or user guide.
Hi there - I can work on this. Just to make sure I understand - would doing this at the scan level mean extracting the min-max from the compressed data in order to determine whether the row group even needs to be materialized?
With regards to the actual docs - does it make sense to add a general section on the main docs page to list the optimizations that are currently implemented / planned? i.e. whats here https://docs.rs/datafusion/5.0.0/datafusion/optimizer/optimizer/trait.OptimizerRule.html could be used for whats implemented.
FWIW this may overlap with #3464
as of https://github.com/apache/arrow-datafusion/pull/4427 the existence of row group pruning will be present in the config settings
Also DataFusion now can push predicates down into the scan!
The feature list is now pretty well listed here: https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/parquet/struct.ParquetExec.html#features
Maybe we can just add a link to the main page / docs site
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We sometimes get questions about support for skipping Parquet row groups based on statistics. It seems that we do not have good documentation around this really cool feature, so we should write something up. We can base it on this response copied from the slack channel.
Describe the solution you'd like Promote this cool feature in the documentation somewhere (user guide? README?)
Describe alternatives you've considered None
Additional context None