NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
49 stars 35 forks source link

[FEA] Qualification tool should recommend partition numbers based on task runtime #955

Open parthosa opened 4 months ago

parthosa commented 4 months ago

If the task runtime of CPU jobs are low, qualification tool should recommend reducing the number of partitions when migrating to GPU.

cc: @kuhushukla

tgravescs commented 4 months ago

is there any other concrete information on what was being seen or what the request is from tuning/GPU. Was AQE on in this case with coalescing?

parthosa commented 4 months ago

I think Kuhu would have a more concrete example.

Example

Let's take a snippet of stages from a sample application.

In the below example, average duration of tasks is very small (Task Duration Avg / Num Tasks). In this case, would it be beneficial to reduce the number of partitions when migrating to GPU?

| Stage Id | Num Tasks | Task Duration (Avg) (ms) | Task Duration Avg / Num Tasks | Stage Duration (ms) | Task Duration (Sum) (ms) |
|----------|-----------|--------------------------|-------------------------------|---------------------|--------------------------|
| 4        |     66358 |                    371.6 |                         0.006 |              412922 |                 24657490 |
| 43       |     12380 |                   9388.8 |                         0.758 |             2185707 |                116233725 |
| 59       |      2000 |                    778.3 |                         0.389 |               58791 |                  1556644 |
| 40       |      1916 |                   6389.1 |                         3.335 |              425866 |                 12241564 |
| 202      |      1600 |                      8.6 |                         0.005 |                 261 |                    13799 |
| 175      |      1600 |                      9.4 |                         0.006 |                 275 |                    15065 |
| 194      |      1600 |                     10.4 |                         0.007 |                 312 |                    16692 |
| 184      |      1600 |                     13.8 |                         0.009 |                 400 |                    22021 |
| 34       |      1600 |                     14.9 |                         0.009 |                3881 |                    23771 |


Spark AQE/Partition Configs:

"spark.sql.adaptive.advisoryPartitionSizeInBytes","32m"
"spark.sql.adaptive.autoBroadcastJoinThreshold","10485760"
"spark.sql.adaptive.coalescePartitions.initialPartitionNum","800"
"spark.sql.adaptive.coalescePartitions.minPartitionNum","320"
"spark.sql.shuffle.partitions","400"
tgravescs commented 4 months ago

task duration is probably not the only thing to be looking at but it would be interesting to investigate some more. There are lots of factor that come into play. Including number of "slots" available to run tasks. there is an overhead to running extra tasks but if you have slots to run everything in parallel that still might be better then making each task say 2x larger. You example likely there are plenty of tasks at 66358 but then again that one isn't likely controlled by the configs you pointed out. I would guess that is a parquet read or something and controlled by the max partition bytes to read.

If we have some real customer logs where we saw this it would be nice, otherwise we can experiment some to see.

kuhushukla commented 4 months ago

@tgravescs yes , that is my recall. AQE was on.

tgravescs commented 3 months ago

ok, I guess unless we have customer log or something showing a real scenario, AQE and coalescing partitions should handle most of this. If the issue is from a parquet read that has to do with the maxPartitionBytes not the number of partitions.

The one thing we could do is look at tuning those AQE settings but we will have to test to see if it helps.

Now with that said we could look at doing the inverse and recommending more partitions if we find some heuristic and allowing the AQE coalescing to bring it down if necessary. This adds in overhead though as well because it still first writes more files in shuffle writer.