chore: Improve CometScan metrics

andygrove commented 5 days ago

Which issue does this PR close?

Part of https://github.com/apache/datafusion-comet/issues/1098

Rationale for this change

What changes are included in this PR?

Various improvements:

Include stage id and partition # in "native explain" logging
Show timings for plan creation when "native explain" is enabled (see example below)
Include the cost of fetching the first batch in CometScan elapsed_compute metric
Include the cost of get_next_batch in CometScan elapsed_compute metric

Example of "native explain" logging

24/11/19 12:47:47 INFO core/src/execution/jni_api.rs: Comet native query plan with metrics (plan creation time: 1275ms):
SortExec: TopK(fetch=10), expr=[col_1@1 DESC NULLS LAST, col_2@2 ASC], preserve_partitioning=[false], metrics=[output_rows=10, elapsed_compute=468.703µs, row_replacements=90]
  CopyExec [UnpackOrDeepCopy], metrics=[output_rows=39508, elapsed_compute=72.877µs]
    ScanExec: source=[], schema=[col_0: Int64, col_1: Decimal128(34, 4), col_2: Date32, col_3: Int32], metrics=[output_rows=39508, elapsed_compute=1.353µs, cast_time=1ns]

How are these changes tested?

Non-functional change. Existing tests.

andygrove commented 4 days ago

I am moving this to draft for now. The plan creation time now matches the ScanExec metric since https://github.com/apache/datafusion-comet/pull/1105 was merged.

I would still like to see the planning time, but only if it excludes execution time for the first batch. I have some ideas I want to try out to resolve this.

24/11/19 19:04:23 INFO core/src/execution/jni_api.rs: Comet native query plan with metrics (plan creation took 1339 ms):
ShuffleWriterExec: partitioning=UnknownPartitioning(1), metrics=[output_rows=10, elapsed_compute=32.602µs, spill_count=0, spilled_bytes=0, data_size=704]
  ScanExec: source=[], schema=[col_0: Int64, col_1: Decimal128(34, 4), col_2: Date32, col_3: Int32], metrics=[output_rows=20, elapsed_compute=1.339205712s, cast_time=1ns]

andygrove commented 4 days ago

@viirya @kazuyukitanimura @comphead Could you take another look? The scope increased slightly (see updated PR description).

This is what the logging looks like now for native query plans with metrics:

[Stage 18 Partition 25] plan creation (including CometScans fetching first batches) took 1.490261253s:
ShuffleWriterExec: partitioning=UnknownPartitioning(1), metrics=[output_rows=10, elapsed_compute=55.574µs, spill_count=0, spilled_bytes=0, data_size=704]
  ScanExec: source=[], schema=[col_0: Int64, col_1: Decimal128(34, 4), col_2: Date32, col_3: Int32], metrics=[output_rows=20, elapsed_compute=1.490176093s, cast_time=1ns]

comphead commented 4 days ago

@viirya @kazuyukitanimura @comphead Could you take another look? The scope increased slightly (see updated PR description).

This is what the logging looks like now for native query plans with metrics:
[Stage 18 Partition 25] plan creation (including CometScans fetching first batches) took 1.490261253s:
ShuffleWriterExec: partitioning=UnknownPartitioning(1), metrics=[output_rows=10, elapsed_compute=55.574µs, spill_count=0, spilled_bytes=0, data_size=704]
  ScanExec: source=[], schema=[col_0: Int64, col_1: Decimal128(34, 4), col_2: Date32, col_3: Int32], metrics=[output_rows=20, elapsed_compute=1.490176093s, cast_time=1ns]

its lgtm, do you need to info! explicitly planning times?

andygrove commented 4 days ago

its lgtm, do you need to info! explicitly planning times?

I'm going to keep iterating on this. I would like to see the actual planning time without the partial execution of fetching the first batch.

apache / datafusion-comet