Open jlowe opened 6 months ago
From CI, this test failure occurred in Spark 3.5.0
I was able to replicate both failures on Spark 3.2.4, 3.3.3, 3.4.0, and 3.5.0 (all versions of Spark that support AQE + DPP)
Basically by the plan output here, it looks like this is an AQE optimization that is turning the entire plan into a LocalTableScan
E py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.rapids.ExecutionPlanCaptureCallback.assertContains.
E : java.lang.AssertionError: assertion failed: Could not find DynamicPruningExpression in the Spark plan
E AdaptiveSparkPlan isFinalPlan=true
E +- == Final Plan ==
E LocalTableScan <empty>, [key#2006, max(value)#2017L]
E +- == Initial Plan ==
E +- == Initial Plan ==
E Sort [key#2006 ASC NULLS FIRST, max(value)#2017L ASC NULLS FIRST], true, 0
E +- Exchange rangepartitioning(key#2006 ASC NULLS FIRST, max(value)#2017L ASC NULLS FIRST, 4), ENSURE_REQUIREMENTS, [plan_id=5302]
E +- HashAggregate(keys=[key#2006], functions=[max(value#2007L)], output=[key#2006, max(value)#2017L])
E +- Exchange hashpartitioning(key#2006, 4), ENSURE_REQUIREMENTS, [plan_id=5299]
E +- HashAggregate(keys=[key#2006], functions=[partial_max(value#2007L)], output=[key#2006, max#2023L])
E +- Union
...
Basically by the plan output here, it looks like this is an AQE optimization that is turning the entire plan into a
LocalTableScan
E py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.rapids.ExecutionPlanCaptureCallback.assertContains. E : java.lang.AssertionError: assertion failed: Could not find DynamicPruningExpression in the Spark plan E AdaptiveSparkPlan isFinalPlan=true E +- == Final Plan == E LocalTableScan <empty>, [key#2006, max(value)#2017L] E +- == Initial Plan == E +- == Initial Plan == E Sort [key#2006 ASC NULLS FIRST, max(value)#2017L ASC NULLS FIRST], true, 0 E +- Exchange rangepartitioning(key#2006 ASC NULLS FIRST, max(value)#2017L ASC NULLS FIRST, 4), ENSURE_REQUIREMENTS, [plan_id=5302] E +- HashAggregate(keys=[key#2006], functions=[max(value#2007L)], output=[key#2006, max(value)#2017L]) E +- Exchange hashpartitioning(key#2006, 4), ENSURE_REQUIREMENTS, [plan_id=5299] E +- HashAggregate(keys=[key#2006], functions=[partial_max(value#2007L)], output=[key#2006, max#2023L]) E +- Union ...
I guess it determined via the join that this would return empty
So basically after some debugging, I think one the subqueries returned an empty result, so that was short-circuited by AQE to return a LocalTableScan <empty>
. This happens on both the CPU and GPU, but of course this means that the result did not contain a DynamicPruningExpression, so it looks like the solution here is that we need update the test logic to be something like an either/or capture. Either there is a single LocalTableScanExec or the GPU plan needs to contain DynamicPruningExpression.
it looks like the solution here is that we need update the test logic to be something like an either/or capture
I'm not sure that's the best fix. The point of this test is to check handling of DPP, and the problem here is that the datagen happened to produce inputs that failed to produce a plan requiring DPP. IMHO a better fix is to update the input data generation to ensure there isn't a degenerate join. If we want to test handling of degenerate joins as well, that should be a separate test that explicitly sets up inputs to produce a degenerate join.
it looks like the solution here is that we need update the test logic to be something like an either/or capture
I'm not sure that's the best fix. The point of this test is to check handling of DPP, and the problem here is that the datagen happened to produce inputs that failed to produce a plan requiring DPP. IMHO a better fix is to update the input data generation to ensure there isn't a degenerate join. If we want to test handling of degenerate joins as well, that should be a separate test that explicitly sets up inputs to produce a degenerate join.
Makes sense. Will investigate what is producing the empty join
test_dpp_empty_relation
already exists, so I think we just need to prevent the degenerate join in this test
Test is now failing again:
FAILED ../../src/main/python/dpp_test.py::test_dpp_reuse_broadcast_exchange[true-5-parquet][DATAGEN_SEED=1707665221, INJECT_OOM, IGNORE_ORDER]
Saw this fail again on Dataproc nightly run.
[2024-02-22T15:41:19.602Z] FAILED ../../src/main/python/dpp_test.py::test_dpp_reuse_broadcast_exchange[true-5-parquet][DATAGEN_SEED=1708615902, IGNORE_ORDER] - py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.s...
[2024-02-22T15:41:19.602Z] FAILED ../../src/main/python/dpp_test.py::test_dpp_reuse_broadcast_exchange[true-5-orc][DATAGEN_SEED=1708615902, IGNORE_ORDER] - py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.s...
[2024-02-22T15:41:19.602Z] = 2 failed, 116 passed, 11 skipped, 26232 deselected, 9 warnings in 557.14s (0:09:17) =
Another failure
[2024-02-29T10:10:57.500Z] FAILED ../../src/main/python/dpp_test.py::test_dpp_reuse_broadcast_exchange[true-5-parquet][DATAGEN_SEED=1709192431, INJECT_OOM, IGNORE_ORDER] - py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.s...
[2024-02-29T10:10:57.500Z] FAILED ../../src/main/python/dpp_test.py::test_dpp_reuse_broadcast_exchange[true-5-orc][DATAGEN_SEED=1709192431, IGNORE_ORDER] - py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.s...
Considering this is actually a test issue (the test not being able to avoid an empty LocalTableScan) and not an issue with the plugin, lowering the priority
From a recent nightly test run: