apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
826 stars 164 forks source link

perf: Add metric for time spent casting in native scan #919

Closed andygrove closed 2 months ago

andygrove commented 2 months ago

Which issue does this PR close?

N/A

Rationale for this change

Make it easy to see how much of the native ScanExec time is spent casting columns to different types (this usually means unpacking dictionaries).

Example from TPC-DS q9:

Screenshot from 2024-09-06 11-57-48

DataFusion metrics in native explain output:

metrics=[
  output_rows=2097152, 
  elapsed_compute=21.847892ms, 
  cast_time=21.631731ms]

Full plan:

AggregateExec: mode=Partial, gby=[], aggr=[count, avg, avg], metrics=[output_rows=1, elapsed_compute=9.481194ms]
  ProjectionExec: expr=[col_1@1 as col_0, col_2@2 as col_1], metrics=[output_rows=400519, elapsed_compute=50.596µs]
    FilterExec: col_0@0 IS NOT NULL AND col_0@0 >= 81 AND col_0@0 <= 100, metrics=[output_rows=400519, elapsed_compute=4.753725ms]
      ScanExec: source=[CometScan parquet  (unknown)], schema=[col_0: Int32, col_1: Decimal128(7, 2), col_2: Decimal128(7, 2)], metrics=[output_rows=2097152, elapsed_compute=21.847892ms, cast_time=21.631731ms]

What changes are included in this PR?

How are these changes tested?

andygrove commented 2 months ago

@viirya @comphead I have addressed feedback. Thanks.