apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
823 stars 163 forks source link

perf: Report accurate total time for scans #916

Closed andygrove closed 2 months ago

andygrove commented 2 months ago

Which issue does this PR close?

Closes https://github.com/apache/datafusion-comet/issues/914

Rationale for this change

The total scan time reported by Comet is often less than the reported time to decode Parquet data, which does not make sense.

The issue is that we convert nano time to milliseconds for each batch and this loses a lot of precision. In one example, the actual total scan time was 41 seconds but it was reported as 23 seconds, which is very misleading. Spark also suffers from this problem.

Before

Screenshot from 2024-09-05 09-56-51

After

Screenshot from 2024-09-05 09-56-09

What changes are included in this PR?

Change scan time to be recorded in nanos.

How are these changes tested?

Manual testing. See earlier screenshots.