Open alamb opened 4 months ago
Some ideas about solving
A substantial amount of time is spent managing the ScalarValues in statistics
Plan to try it today.
See one simple thing is, refactor the Statistics
to:
pub struct StatisticsInner {
/// The number of table rows.
pub num_rows: Precision<usize>,
/// Total bytes of the table rows.
pub total_byte_size: Precision<usize>,
/// Statistics on a column level. It contains a [`ColumnStatistics`] for
/// each field in the schema of the table to which the [`Statistics`] refer.
pub column_statistics: Vec<ColumnStatistics>,
}
pub struct Statistics {
inner: Arc<StatisticsInner>,
}
And the clone of Arc
is trivial.
take
That would be a very interesting experiment to try
Based on the detail profile about q0 in clickbench as following, maybe the optimization work can be divided into three parts:
Statistics
get_statistics_with_limit
(seems may tmp vectors exist, but not sure)Trying the first possible optimization now.
https://github.com/apache/datafusion/pull/11802 is very nice 👌 It would be fascinating to know what the flamegraph looks like after that PR (aka what are next highest bottleneck)
11802 is very nice 👌 It would be fascinating to know what the flamegraph looks like after that PR (aka what are next highest bottleneck)
😄 I guess they will be the plan creation
and object store list
.
Is your feature request related to a problem or challenge?
I spent some time looking at the ClickBench results with DataFusion 40.0.0 https://github.com/apache/datafusion/issues/11567#issuecomment-2254520675 (thanks @pmcgleenon 🙏 )
Specifically, I looked into how we could make some of the already fast queries on the the partitioned dataset faster. Unsurprisingly, for the really fast queries the query time is actually dominated by parquet metadata analysis and DataFusion statistics creation.
For example
ClickBench Q0
To reproduce, run:
I profiled this using Instruments. Here are some annotated screenshots
Some of my take aways are
Describe the solution you'd like
If would be cool to make these queries faster by reducing the per file metadata handling overhead (e.g. don't read the metadata more than once and figure out some way to make statistics handling more efficient)
Describe alternatives you've considered
Note this project isn't broken down into tasks yet
I think @Ted-Jiang did some work way back to cache parquet metaddata
Additional context
No response