apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.36k stars 1.2k forks source link

Parquet Statistics Pruning Ignores ColumnOrder, resulting in potentially incorrect statistics #8342

Open tustvold opened 12 months ago

tustvold commented 12 months ago

Describe the bug

The statistics are only valid if interpreted in the context of ColumnOrder, otherwise the results are not necessarily correct

The ColumnOrder field in Parquet statistics says what ordering was used to compute the min/max values and seems not to be widely used or populated in the eco system. However, ignoring it when it is present is probably wrong

To Reproduce

No response

Expected behavior

No response

Additional context

No response

alamb commented 12 months ago

https://docs.rs/parquet/latest/parquet/basic/enum.ColumnOrder.html is the relevant code

alamb commented 11 months ago

I believe we fixed this in https://github.com/apache/arrow-datafusion/pull/8294

But I am not 100% sure given the dearth of information on this ticket. Please reopen it if I am misunderstanding

tustvold commented 11 months ago

Afraid this is tracking something different that PR didn't address, as we aren't even populating this correctly in parquet-rs currently - https://github.com/apache/arrow-rs/issues/5152

alamb commented 11 months ago

Thanks, updated the description hopefully to provide a little more background

edmondop commented 11 months ago

I pick this one if I can