apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.48k stars 1.01k forks source link

Extract parquet statistics from `LargeUtf8` columns #10756

Closed alamb closed 3 weeks ago

alamb commented 4 weeks ago

Is your feature request related to a problem or challenge?

Part of https://github.com/apache/datafusion/issues/10453, where we are filling out support for extracting statistics for all data types from parquet files

At the moment, even if statistics are extracted for a different type (like Int32) the PruningPredicate will attempt to cast these values to the correct type:

https://github.com/apache/datafusion/blob/acd7106fa40fad58f50ae06227971c51073d8f48/datafusion/core/src/physical_optimizer/pruning.rs#L909-L911

However, in order to be efficient and ensure the cast kernel doesn't add anything incorrectly, we should be extracting the parquet statistics as the correct Array type directly. It turns out we do not do this yet for several types and those types do not have good (or any) test coverage. We almost missed this in https://github.com/apache/datafusion/pull/10711 in @xinlifoobar

Thus, we need to add support and tests for other types

Describe the solution you'd like

  1. Add a new test to arrow_stastics.rs (run this with cargo test --test parquet_exec) with the relevant type
  2. Potentially add a new case to the match here https://github.com/apache/datafusion/blob/acd7106fa40fad58f50ae06227971c51073d8f48/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L61-L182 to get the correct types

Here are some example PRs:

  1. https://github.com/apache/datafusion/pull/10729
  2. https://github.com/apache/datafusion/pull/10593

Describe alternatives you've considered

No response

Additional context

No response

Weijun-H commented 4 weeks ago

I am glad to work on this

alamb commented 4 weeks ago

Thanks @Weijun-H !