apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.02k stars 1.14k forks source link

`ParquetExec::statistics()` does not read statistics for many column types (like timstamps, strings, etc) #8295

Open alamb opened 10 months ago

alamb commented 10 months ago

Describe the bug

While working on https://github.com/apache/arrow-datafusion/issues/8229 I found another bug that is non obvious, but that can be clearly seen now thanks to https://github.com/apache/arrow-datafusion/issues/8110 and https://github.com/apache/arrow-datafusion/issues/8111 from @NGA-TRAN

To Reproduce

❯ copy (values ('foo'), ('bar'), ('baz')) to '/tmp/strings.parquet';
+-------+
| count |
+-------+
| 3     |
+-------+
1 row in set. Query took 0.023 seconds.

And then look at the explain verbose up can see there are no min/max statisics shown:

❯ explain verbose select * from '/tmp/strings.parquet';

|                                                            |                                                                                                                                                                |
| physical_plan_with_stats                                   | ParquetExec: file_groups={1 group: [[private/tmp/strings.parquet]]}, projection=[column1], statistics=[Rows=Exact(3), Bytes=Absent, [(Col[0]: Null=Exact(0))]] |
|                                                            |                                                                                                                                                                |
+------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
80 rows in set. Query took 0.002 seconds.

Expected behavior

I expect there to be min/max values extracted in the statistics for the strings, as there are for integers ((Col[0]: Min=Exact(Int64(1)) Max=Exact(Int64(3)))

❯ copy (values (1), (2), (3)) to '/tmp/ints.parquet';
+-------+
| count |
+-------+
| 3     |
+-------+
1 row in set. Query took 0.023 seconds.
❯ explain verbose select * from '/tmp/ints.parquet';
...
                                                                                                               |
| physical_plan                                              | ParquetExec: file_groups={1 group: [[private/tmp/ints.parquet]]}, projection=[column1]                                                                                                              |
|                                                            |                                                                                                                                                                                                     |
| physical_plan_with_stats                                   | ParquetExec: file_groups={1 group: [[private/tmp/ints.parquet]]}, projection=[column1], statistics=[Rows=Exact(3), Bytes=Absent, [(Col[0]: Min=Exact(Int64(1)) Max=Exact(Int64(3)) Null=Exact(0))]] |
|                                                            |                                                                                                                                                                                                     |
+------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Additional context

No response

alamb commented 10 months ago

Note that the pruning predicate code does correctly read the statistics for other strings and timestamps, because it uses a different code path

alamb commented 10 months ago

I plan to fix this

Weijun-H commented 8 months ago

Could I pick this ticket up?

Weijun-H commented 8 months ago

In fn summarize_min_max, it cannot handle ByteArray(ValueStatistics<ByteArray>) well. Do we need to convert it to a different type like timestamps, strings, etc 🤔 ?

alamb commented 8 months ago

In fn summarize_min_max, it cannot handle ByteArray(ValueStatistics<ByteArray>) well. Do we need to convert it to a different type like timestamps, strings, etc 🤔 ?

I think there is some subtly related to decimals as well -- the best thing to do is probably to study what the existing code in row_groups does -- I think it is here https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L57

alamb commented 8 months ago

At some point there were multiple code paths to extract statistics in parquet (one for file level and one for row group level) that should likely be combined