apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
547 stars 105 forks source link

feat: Add extended explain info to Comet plan #255

Closed parthchandra closed 1 month ago

parthchandra commented 1 month ago

Which issue does this PR close?

Closes #253

Rationale for this change

Adds addition planning information to a Spark plan which allows us to track the reasons why a Spark plan was not fully converted to Comet.

What changes are included in this PR?

This PR adds a CometExplainInfo structure that is returned by every step of the planning process. Eventually the Structure is attached to the Spark plan being returned by adding it as a Comet specific tag

How are these changes tested?

Additional unit test and tested with all TPCH and TPCDS queries with both Spark 3.4 and Spark 3.4 with the extended plan support. Also regenerated the CometTPC*QueriesList output for both TPCH and TPCDS run against a 1TB dataset. Note: Since the extended plan support does not exist before Spark 4, the output of CometTPC*QueriesList was updated with a build of Spark 3.4 with the feature backported.

parthchandra commented 1 month ago

Looking at the ci failures, will address the comments as well.

parthchandra commented 1 month ago

@andygrove I changed the core of the implementation. Instead of setting information in a CometExplainInfo structure and bubbling it up in the plan, I now set the explain information as a spark tag in the plan or expression. This makes the changes easier to implement as we add support for more operators and expressions. Please take a look.

parthchandra commented 1 month ago

@advancedxy, @andygrove Addressed your comments. I was missing a few paths where the explanation was not being propagated correctly. I've fixed the ones I could find. Also, updated the results in CometTPC*QueriesList. I had run this previously on a 1TB dataset and not run these after I rebased on main. I've run it again on a 1GB dataset but had to explicitly set some Bloom filter configuration options to simulate the conditions of the larger data set. Otherwise the settings are the same as the defaults we expect most users to have.

parthchandra commented 1 month ago

Looking into the ci failures ...

parthchandra commented 1 month ago

@andygrove could you please trigger a re-run of the failed ci workflow? (Only committers can do that). I cannot reproduce the failure locally and am pretty sure it has nothing to do with this PR.

viirya commented 1 month ago

Re-triggered.

parthchandra commented 1 month ago

Could we merge this please? @viirya @andygrove

viirya commented 1 month ago

There is conflict needed to be resolved. And I have a few minor comments like code comment changes.

advancedxy commented 1 month ago

could you please trigger a re-run of the failed ci workflow? (Only committers can do that). I cannot reproduce the failure locally and am pretty sure it has nothing to do with this PR.

BTW, you can close and re-open the PR to re-trigger the CI workflow. Or you can push an empty commit to re-trigger.

viirya commented 1 month ago

Thank you @parthchandra. I will merge this once CI passes.

parthchandra commented 1 month ago

Thanks for the review @viirya @andygrove @advancedxy

viirya commented 1 month ago

Merged. Thanks all.