Open andygrove opened 5 days ago
The reason that each plan appears to be planned twice is that we split the ShuffleWriterExec
from the rest of the plan (as noted in https://github.com/apache/datafusion-comet/issues/977).
executePlan() stage 18 partition 28 of 29: planning took 666.045344ms
Comet native query plan:
SortExec: TopK(fetch=10), expr=[col_1@1 DESC NULLS LAST, col_2@2 ASC], preserve_partitioning=[false]
CopyExec [UnpackOrDeepCopy]
ScanExec: source=[], schema=[col_0: Int64, col_1: Decimal128(34, 4), col_2: Date32, col_3: Int32]
executePlan() stage 18 partition 28 of 29: planning took 689.940558ms
Comet native query plan:
ShuffleWriterExec: partitioning=UnknownPartitioning(1)
ScanExec: source=[], schema=[col_0: Int64, col_1: Decimal128(34, 4), col_2: Date32, col_3: Int32]
For comparison, physical planning time in Ballista for the same query (TPC-H q3) never takes more than 1ms, and overall execution time is ~6 seconds compares to ~20 seconds in Comet.
This is getting pretty interesting. I improved the native explain feature in https://github.com/apache/datafusion-comet/pull/1099 and we now see the planning time. The following trivial plan takes more than a second to plan.
24/11/19 11:25:26 INFO core/src/execution/jni_api.rs: Comet native query plan (planning took 1.304405241s):
ShuffleWriterExec: partitioning=UnknownPartitioning(1)
ScanExec: source=[], schema=[col_0: Int64, col_1: Decimal128(34, 4), col_2: Date32, col_3: Int32]
I also added a criterion benchmark to plan this query in the same PR, and it only takes 23 microseconds.
Another observation is that the planning time increases each time:
24/11/19 11:25:24 INFO core/src/execution/jni_api.rs: Comet native query plan (planning took 1.377855944s):
24/11/19 11:25:24 INFO core/src/execution/jni_api.rs: Comet native query plan (planning took 1.403194796s):
24/11/19 11:25:24 INFO core/src/execution/jni_api.rs: Comet native query plan (planning took 1.424167789s):
24/11/19 11:25:24 INFO core/src/execution/jni_api.rs: Comet native query plan (planning took 1.487451934s):
24/11/19 11:25:24 INFO core/src/execution/jni_api.rs: Comet native query plan (planning took 1.492228413s):
24/11/19 11:25:24 INFO core/src/execution/jni_api.rs: Comet native query plan (planning took 1.517392898s):
24/11/19 11:25:24 INFO core/src/execution/jni_api.rs: Comet native query plan (planning took 1.534502296s):
24/11/19 11:25:24 INFO core/src/execution/jni_api.rs: Comet native query plan (planning took 1.556645236s):
24/11/19 11:25:24 INFO core/src/execution/jni_api.rs: Comet native query plan (planning took 1.622664488s):
24/11/19 11:25:24 INFO core/src/execution/jni_api.rs: Comet native query plan (planning took 1.622664287s):
24/11/19 11:25:24 INFO core/src/execution/jni_api.rs: Comet native query plan (planning took 1.730488921s):
24/11/19 11:25:24 INFO core/src/execution/jni_api.rs: Comet native query plan (planning took 1.730601884s):
24/11/19 11:25:24 INFO core/src/execution/jni_api.rs: Comet native query plan (planning took 1.730662237s):
24/11/19 11:25:24 INFO core/src/execution/jni_api.rs: Comet native query plan (planning took 1.73067477s):
Something strange is happening here. Does this indicate that some sort of contention or locking is happening which is resulting in these long times?
There are other instances where the planning is much faster:
24/11/19 11:44:20 INFO core/src/execution/jni_api.rs: Comet native query plan (planning took 15.748199ms):
ShuffleWriterExec: partitioning=Hash([Column { name: "col_1", index: 1 }], 200)
ScanExec: source=[], schema=[col_0: Int64, col_1: Int64, col_2: Date32, col_3: Int32]
Plan creation time can take longer than actually executing the plan in some cases:
24/11/19 12:47:47 INFO core/src/execution/jni_api.rs: Comet native query plan with metrics (plan creation time: 1361ms):
ShuffleWriterExec: partitioning=UnknownPartitioning(1), metrics=[output_rows=10, elapsed_compute=28.875µs, spill_count=0, spilled_bytes=0, data_size=704]
ScanExec: source=[], schema=[col_0: Int64, col_1: Decimal128(34, 4), col_2: Date32, col_3: Int32], metrics=[output_rows=10, elapsed_compute=532ns, cast_time=1ns]
24/11/19 12:47:31 INFO core/src/execution/jni_api.rs: Comet native query plan with metrics (plan creation time: 18ms):
FilterExec: col_2@2 IS NOT NULL AND col_2@2 < 1995-03-15 AND col_1@1 IS NOT NULL AND col_0@0 IS NOT NULL, metrics=[output_rows=1529587, elapsed_compute=9.038835ms]
ScanExec: source=[CometScan parquet (unknown)], schema=[col_0: Int64, col_1: Int64, col_2: Date32, col_3: Int32], metrics=[output_rows=3145728, elapsed_compute=4.589971ms, cast_time=3.56982ms]
What is the problem the feature request solves?
For each query stage, the serialized query plan is sent to the executor with each task. Each task deserializes the protobuf and then creates a
PhysicalPlanner
and builds a native query plan. The query plan for each partition in a stage is essentially identical, except for the scan input JNI references, so we are duplicating this query planning work across each partition.In some cases, planning is very expensive, TPC-H q3 stage 18 seems to take around 90 seconds. Here is partial debug output. Note that each partition seems to create the query plan twice, which needs further investigation.
Here is another example where planning is relatively cheap, but repeated many times, resulting in 1.76 seconds total planning time.
Questions:
I used the following code to pass the partition numbers to the native code:
Describe the potential solution
No response
Additional context
No response