apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
447 stars 100 forks source link

feat: Improve CometBroadcastHashJoin statistics #339

Closed planga82 closed 2 weeks ago

planga82 commented 2 weeks ago

Which issue does this PR close?

Closes #338 .

Rationale for this change

Add all statistics HashJoinExec datafusion node provides.

What changes are included in this PR?

All available metrics

/// Total time for collecting build-side of join
pub(crate) build_time: metrics::Time
/// Number of batches consumed by build-side
pub(crate) build_input_batches: metrics::Count,
/// Number of rows consumed by build-side
pub(crate) build_input_rows: metrics::Count,
/// Memory used by build-side in bytes
pub(crate) build_mem_used: metrics::Gauge,
/// Total time for joining probe-side batches to the build-side batches
pub(crate) join_time: metrics::Time,
/// Number of batches consumed by probe-side of this operator
pub(crate) input_batches: metrics::Count,
/// Number of rows consumed by probe-side this operator
pub(crate) input_rows: metrics::Count,
/// Number of batches produced by this operator
pub(crate) output_batches: metrics::Count,
/// Number of rows produced by this operator
pub(crate) output_rows: metrics::Count

image

How are these changes tested?

Unit testing and manual testing

planga82 commented 2 weeks ago

It seems that there are problems in tests with Spark 3.3 and Spark 3.2. I'm checking it out.

planga82 commented 2 weeks ago

Fix tested in my repository with github actions

viirya commented 2 weeks ago

Merged. Thanks @planga82 @kazuyukitanimura