`ParquetExec::statistics()` does not read statistics for many column types (like timstamps, strings, etc)

alamb commented 11 months ago

Describe the bug

While working on https://github.com/apache/arrow-datafusion/issues/8229 I found another bug that is non obvious, but that can be clearly seen now thanks to https://github.com/apache/arrow-datafusion/issues/8110 and https://github.com/apache/arrow-datafusion/issues/8111 from @NGA-TRAN

To Reproduce

❯ copy (values ('foo'), ('bar'), ('baz')) to '/tmp/strings.parquet';
+-------+
| count |
+-------+
| 3     |
+-------+
1 row in set. Query took 0.023 seconds.

And then look at the explain verbose up can see there are no min/max statisics shown:

❯ explain verbose select * from '/tmp/strings.parquet';

|                                                            |                                                                                                                                                                |
| physical_plan_with_stats                                   | ParquetExec: file_groups={1 group: [[private/tmp/strings.parquet]]}, projection=[column1], statistics=[Rows=Exact(3), Bytes=Absent, [(Col[0]: Null=Exact(0))]] |
|                                                            |                                                                                                                                                                |
+------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
80 rows in set. Query took 0.002 seconds.

Expected behavior

I expect there to be min/max values extracted in the statistics for the strings, as there are for integers ((Col[0]: Min=Exact(Int64(1)) Max=Exact(Int64(3)))

❯ copy (values (1), (2), (3)) to '/tmp/ints.parquet';
+-------+
| count |
+-------+
| 3     |
+-------+
1 row in set. Query took 0.023 seconds.

❯ explain verbose select * from '/tmp/ints.parquet';
...
                                                                                                               |
| physical_plan                                              | ParquetExec: file_groups={1 group: [[private/tmp/ints.parquet]]}, projection=[column1]                                                                                                              |
|                                                            |                                                                                                                                                                                                     |
| physical_plan_with_stats                                   | ParquetExec: file_groups={1 group: [[private/tmp/ints.parquet]]}, projection=[column1], statistics=[Rows=Exact(3), Bytes=Absent, [(Col[0]: Min=Exact(Int64(1)) Max=Exact(Int64(3)) Null=Exact(0))]] |
|                                                            |                                                                                                                                                                                                     |
+------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Additional context

No response

alamb commented 11 months ago

Note that the pruning predicate code does correctly read the statistics for other strings and timestamps, because it uses a different code path

alamb commented 11 months ago

I plan to fix this

Weijun-H commented 9 months ago

Could I pick this ticket up?

Weijun-H commented 9 months ago

In fn summarize_min_max, it cannot handle ByteArray(ValueStatistics<ByteArray>) well. Do we need to convert it to a different type like timestamps, strings, etc 🤔 ?

alamb commented 9 months ago

In fn summarize_min_max, it cannot handle ByteArray(ValueStatistics<ByteArray>) well. Do we need to convert it to a different type like timestamps, strings, etc 🤔 ?

I think there is some subtly related to decimals as well -- the best thing to do is probably to study what the existing code in row_groups does -- I think it is here https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L57

alamb commented 9 months ago

At some point there were multiple code paths to extract statistics in parquet (one for file level and one for row group level) that should likely be combined

apache / datafusion