apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
https://paimon.apache.org/
Apache License 2.0
2.48k stars 970 forks source link

[spark] Fix reported statistics does not do column pruning #4137

Closed ulysses-you closed 2 months ago

ulysses-you commented 2 months ago

Purpose

SupportsReportStatistics#estimateStatistics should report the statistics after do column pruning, filter push down. This pr fixes it does not do column pruning. In addition, we should consider the metadata column size.

Introduce defaultSize method for DataType, so that we can get the size in bytes without column statistics.

Tests

add test

API and Format

no

Documentation

ulysses-you commented 2 months ago

cc @JingsongLi @Zouxxyy thank you

JingsongLi commented 2 months ago

@Zouxxyy We may need to run a TPC-DS test to show the compact.

Zouxxyy commented 2 months ago

@ulysses-you Thanks for the contribution, have you tested the effect in production scenarios? We tested the old codes and it had the same effect as parquet's analyze.

ulysses-you commented 2 months ago

thank you @Zouxxyy. I tested with 20 columns and using explain cost to see the sinzeInBytes after doing analyze, it does affect:

without this pr:

explain cost select c1, c2, c3 from w_t;

== Optimized Logical Plan ==
RelationV2[c1#163, c2#164, c3#165] default.w_t, Statistics(sizeInBytes=36.8 MiB, rowCount=1.00E+5)

explain cost select * from w_t;

== Optimized Logical Plan ==
RelationV2[c1#194, c2#195, c3#196, c4#197, c5#198, c6#199, c7#200, c8#201, c9#202, c10#203, c11#204, c12#205, c13#206, c14#207, c15#208, c16#209, c17#210, c18#211, c19#212, c20#213] default.w_t, Statistics(sizeInBytes=36.8 MiB, rowCount=1.00E+5)

with this pr:

explain cost select c1, c2, c3 from w_t;

== Optimized Logical Plan ==
RelationV2[c1#5, c2#6, c3#7] default.w_t, Statistics(sizeInBytes=5.5 MiB, rowCount=1.00E+5)

explain cost select * from w_t;

== Optimized Logical Plan ==
RelationV2[c1#36, c2#37, c3#38, c4#39, c5#40, c6#41, c7#42, c8#43, c9#44, c10#45, c11#46, c12#47, c13#48, c14#49, c15#50, c16#51, c17#52, c18#53, c19#54, c20#55] default.w_t, Statistics(sizeInBytes=36.8 MiB, rowCount=1.00E+5)

In general, the sinzeInBytes would affect broadcast join, runtime filter, etc. So it can happen that, nothing changes with pr if the query does not hit the related case.

JingsongLi commented 2 months ago

+1