apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
823 stars 163 forks source link

chore: Override node name for CometSparkToColumnar #958

Open JensonChoi opened 2 months ago

JensonChoi commented 2 months ago

Which issue does this PR close?

Closes #936

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Added unit test for row input. Still finding a Spark query that would force the use of CometSparkColumnarToColumnar node name.

JensonChoi commented 1 month ago

@andygrove can I get a review please? Thank you in advance.

andygrove commented 1 month ago

@andygrove can I get a review please? Thank you in advance.

Thanks for the PR @JensonChoi. The other change that will need to be made is to update the golden files for the tests that check for expected plans. You can find more information in the contributor guide: https://datafusion.apache.org/comet/contributor-guide/development.html#plan-stability-testing

JensonChoi commented 1 month ago

@andygrove Following our discussions on Discord, I ran the following commands to update the golden files.

make clean; make release PROFILES="-Pspark-3.4"
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-3.4 -nsu test
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-3.4 -nsu test

make clean; make release PROFILES="-Pspark-3.5"
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-3.5 -nsu test
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-3.5 -nsu test

make clean; make release PROFILES="-Pspark-4.0"
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-4.0 -nsu test
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-4.0 -nsu test

The odd thing is that no file change was detected by git even though the commands ran successfully. Is it possible that updating the node name is not something that would be picked up by stability testing? Thank you in advance.

andygrove commented 1 month ago

@andygrove Following our discussions on Discord, I ran the following commands to update the golden files.

make clean; make release PROFILES="-Pspark-3.4"
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-3.4 -nsu test
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-3.4 -nsu test

make clean; make release PROFILES="-Pspark-3.5"
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-3.5 -nsu test
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-3.5 -nsu test

make clean; make release PROFILES="-Pspark-4.0"
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-4.0 -nsu test
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-4.0 -nsu test

The odd thing is that no file change was detected by git even though the commands ran successfully. Is it possible that updating the node name is not something that would be picked up by stability testing? Thank you in advance.

It is possible that the golden files got written to a different location is you have SPARK_HOME set. Could you try unsettling that env var if you have it set? If you don't have that set, perhaps add logging to see where the files are being written?

JensonChoi commented 4 weeks ago

SPARK_HOME is currently set to /home/jenson/datafusion-comet, which is my local clone of the repo. Does that sound about right? I will try the logging route that you suggested to see where the files are being written to. Thanks.

JensonChoi commented 3 weeks ago

@andygrove I reran the following commands without the SPARK_GENERATE_GOLDEN_FILES=1 flag:

make clean; make release PROFILES="-Pspark-3.4"
./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-3.4 -nsu test
./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-3.4 -nsu test

make clean; make release PROFILES="-Pspark-3.5"
./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-3.5 -nsu test
./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-3.5 -nsu test

make clean; make release PROFILES="-Pspark-4.0"
./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-4.0 -nsu test
./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-4.0 -nsu test

And they all passed. In addition, I briefly looked at the stability test suites here, and it appears none of them contains the node CometSparkToColumnar. What should be the next steps here?

andygrove commented 3 weeks ago

And they all passed. In addition, I briefly looked at the stability test suites here, and it appears none of them contains the node CometSparkToColumnar. What should be the next steps here?

Thanks for investigating this @JensonChoi. The golden files do not contain any RowToColumnar transitions, so this explains why they did not need updating.