apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.48k stars 1.01k forks source link

Extract parquet statistics from `Duration` columns #10754

Closed alamb closed 3 weeks ago

alamb commented 4 weeks ago

Is your feature request related to a problem or challenge?

Part of https://github.com/apache/datafusion/issues/10453, where we are filling out support for extracting statistics for all data types from parquet files

At the moment, even if statistics are extracted for a different type (like Int32) the PruningPredicate will attempt to cast these values to the correct type:

https://github.com/apache/datafusion/blob/acd7106fa40fad58f50ae06227971c51073d8f48/datafusion/core/src/physical_optimizer/pruning.rs#L909-L911

However, in order to be efficient and ensure the cast kernel doesn't add anything incorrectly, we should be extracting the parquet statistics as the correct Array type directly. It turns out we do not do this yet for several types and those types do not have good (or any) test coverage. We almost missed this in https://github.com/apache/datafusion/pull/10711 in @xinlifoobar

Thus, we need to add support and tests for other types

Describe the solution you'd like

  1. Add a new test to arrow_stastics.rs (run this with cargo test --test parquet_exec) with the relevant type
  2. Potentially add a new case to the match here https://github.com/apache/datafusion/blob/acd7106fa40fad58f50ae06227971c51073d8f48/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L61-L182 to get the correct types

Here are some example PRs:

  1. https://github.com/apache/datafusion/pull/10729
  2. https://github.com/apache/datafusion/pull/10593

Describe alternatives you've considered

No response

Additional context

No response

marvinlanhenke commented 4 weeks ago

take

marvinlanhenke commented 4 weeks ago

@alamb ...also while looking into this. I think Duration is not supported, thus we cannot extract statistics?

alamb commented 3 weeks ago

Thanks @marvinlanhenke -- I agree that since Duration can't be written to parquet we won't be able to extract statistics

Thank you for double checking