Closed alamb closed 3 weeks ago
take
@alamb
...also while looking into this. I think Duration
is not supported, thus we cannot extract statistics?
Thanks @marvinlanhenke -- I agree that since Duration can't be written to parquet we won't be able to extract statistics
Thank you for double checking
Is your feature request related to a problem or challenge?
Part of https://github.com/apache/datafusion/issues/10453, where we are filling out support for extracting statistics for all data types from parquet files
At the moment, even if statistics are extracted for a different type (like
Int32
) the PruningPredicate will attempt to cast these values to the correct type:https://github.com/apache/datafusion/blob/acd7106fa40fad58f50ae06227971c51073d8f48/datafusion/core/src/physical_optimizer/pruning.rs#L909-L911
However, in order to be efficient and ensure the cast kernel doesn't add anything incorrectly, we should be extracting the parquet statistics as the correct Array type directly. It turns out we do not do this yet for several types and those types do not have good (or any) test coverage. We almost missed this in https://github.com/apache/datafusion/pull/10711 in @xinlifoobar
Thus, we need to add support and tests for other types
Describe the solution you'd like
cargo test --test parquet_exec
) with the relevant typeHere are some example PRs:
Describe alternatives you've considered
No response
Additional context
No response