apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.06k stars 1.14k forks source link

Add documentation for support for skipping Parquet row groups #825

Open andygrove opened 3 years ago

andygrove commented 3 years ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

We sometimes get questions about support for skipping Parquet row groups based on statistics. It seems that we do not have good documentation around this really cool feature, so we should write something up. We can base it on this response copied from the slack channel.

DataFusion has support for skipping entire row groups using predicates and min and max statistics.

It does not (yet) push the predicates down into the actual scan (e.g. to avoid materializing data that wouldn’t pass the predicate) — instead any row groups that are not pruned are decompressed into RecordBatches and then filtered.
Also, DataFusion will do “projection pushdown” — aka it will read only those columns needed to answer the query.

Describe the solution you'd like Promote this cool feature in the documentation somewhere (user guide? README?)

Describe alternatives you've considered None

Additional context None

houqp commented 3 years ago

seems like something that would be a good fit for design doc or user guide.

matthewmturner commented 3 years ago

Hi there - I can work on this. Just to make sure I understand - would doing this at the scan level mean extracting the min-max from the compressed data in order to determine whether the row group even needs to be materialized?

With regards to the actual docs - does it make sense to add a general section on the main docs page to list the optimizations that are currently implemented / planned? i.e. whats here https://docs.rs/datafusion/5.0.0/datafusion/optimizer/optimizer/trait.OptimizerRule.html could be used for whats implemented.

tustvold commented 1 year ago

FWIW this may overlap with #3464

alamb commented 1 year ago

as of https://github.com/apache/arrow-datafusion/pull/4427 the existence of row group pruning will be present in the config settings

Also DataFusion now can push predicates down into the scan!

alamb commented 2 weeks ago

The feature list is now pretty well listed here: https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/parquet/struct.ParquetExec.html#features

Maybe we can just add a link to the main page / docs site