apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
823 stars 163 forks source link

perf: Remove one redundant CopyExec for SMJ #962

Closed andygrove closed 1 month ago

andygrove commented 1 month ago

Which issue does this PR close?

N/A

Rationale for this change

We were injecting too many CopyExecs for SMJ. Plans looked like this:

SortMergeJoin
  CopyExec [UnpackOrDeepClone]
    SortExec
      CopyExec [UnpackOrDeepClone]
        ...
  CopyExec [UnpackOrDeepClone]
    SortExec
      CopyExec [UnpackOrDeepClone]
        ...

What changes are included in this PR?

With the changes in this PR, the plan now looks like:

SortMergeJoin
  SortExec
    CopyExec [UnpackOrDeepClone]
      ...
  SortExec
    CopyExec [UnpackOrDeepClone]
      ...

How are these changes tested?

Existing tests

andygrove commented 1 month ago

~I ran TPC-DS benchmarks comparing the latest from the main branch with this PR, and somehow, this PR introduces a considerable regression in performance. I am investigating.~

edit: I ran the wrong script :facepalm:

andygrove commented 1 month ago

A single run of TPC-DS comparing main to this PR does not show a significant difference (~1% difference). I did not expect this to make much difference, but it does remove a redundant operator and simplified the plan,