Closed ulysses-you closed 2 months ago
cc @JingsongLi @Zouxxyy thank you
@Zouxxyy We may need to run a TPC-DS test to show the compact.
@ulysses-you Thanks for the contribution, have you tested the effect in production scenarios? We tested the old codes and it had the same effect as parquet's analyze.
thank you @Zouxxyy. I tested with 20 columns and using explain cost
to see the sinzeInBytes after doing analyze, it does affect:
without this pr:
explain cost select c1, c2, c3 from w_t;
== Optimized Logical Plan ==
RelationV2[c1#163, c2#164, c3#165] default.w_t, Statistics(sizeInBytes=36.8 MiB, rowCount=1.00E+5)
explain cost select * from w_t;
== Optimized Logical Plan ==
RelationV2[c1#194, c2#195, c3#196, c4#197, c5#198, c6#199, c7#200, c8#201, c9#202, c10#203, c11#204, c12#205, c13#206, c14#207, c15#208, c16#209, c17#210, c18#211, c19#212, c20#213] default.w_t, Statistics(sizeInBytes=36.8 MiB, rowCount=1.00E+5)
with this pr:
explain cost select c1, c2, c3 from w_t;
== Optimized Logical Plan ==
RelationV2[c1#5, c2#6, c3#7] default.w_t, Statistics(sizeInBytes=5.5 MiB, rowCount=1.00E+5)
explain cost select * from w_t;
== Optimized Logical Plan ==
RelationV2[c1#36, c2#37, c3#38, c4#39, c5#40, c6#41, c7#42, c8#43, c9#44, c10#45, c11#46, c12#47, c13#48, c14#49, c15#50, c16#51, c17#52, c18#53, c19#54, c20#55] default.w_t, Statistics(sizeInBytes=36.8 MiB, rowCount=1.00E+5)
In general, the sinzeInBytes
would affect broadcast join, runtime filter, etc. So it can happen that, nothing changes with pr if the query does not hit the related case.
+1
Purpose
SupportsReportStatistics#estimateStatistics
should report the statistics after do column pruning, filter push down. This pr fixes it does not do column pruning. In addition, we should consider the metadata column size.Introduce
defaultSize
method forDataType
, so that we can get the size in bytes without column statistics.Tests
add test
API and Format
no
Documentation