neilbest-db commented 1 week ago

Introduction

A typical run of Overwatch module 2011 (JobsRuns_Silver, "JR") using release 0.8.1.2 to process 10 days of data in workspace e2-demo-west-ws produces this console output:

. . .
COMPLETED: 2011-Silver_JobsRuns --> Workspace ID: 2556758628403379
TIME RANGE: 2024-05-09 00:00:00 -> 2024-05-19 00:00:00
 SUCCESS! Silver_JobsRuns
OUTPUT ROWS: 210402
RUNTIME MINS: 16.25 --> Workspace ID: 2556758628403379
. . .

The corresponding run of the code from this feature branch (1228-silver-job-runs-spark312-r0812) in an otherwise identical environment against the same upstream data goes like this:

. . .
COMPLETED: 2011-Silver_JobsRuns --> Workspace ID: 2556758628403379
TIME RANGE: 2024-05-09 00:00:00 -> 2024-05-19 00:00:00
 SUCCESS! Silver_JobsRuns
OUTPUT ROWS: 210402
RUNTIME MINS: 5.25 --> Workspace ID: 2556758628403379
. . .

This ~67% reduction in runtime minutes is characteristic of the changes introduced here. Below are some details on how this was achieved, but first a little background.

Background

Databricks AQE shuffle auto-optimization

As part of Overwatch release 0.7.1.1, #675 introduced the following Spark configuration globally for all modules:

https://github.com/databrickslabs/overwatch/blob/ae56406ea5aef1e1e8a40b3653a9e6df70922458/src/main/scala/com/databricks/labs/overwatch/utils/SparkSessionWrapper.scala#L20

This is a component of a larger set of features called Adaptive Query Execution (AQE) that was released with Databricks Runtime (DBR) 7.0 and Spark 3.0.0. See this Databricks Engineering Blog article for more details on AQE in general.

Soon thereafter performance degradation in certain Overwatch modules was reported in https://github.com/databrickslabs/overwatch/issues/794#issue-1607557972:

In an attempt to increase performance -- a regression was made that results in the larger modules (i.e. spark table merges) to have WAY too few tasks resulting in extremely long -- never finishing runtimes.

Solution -- disable this flag.

. . . and the autoOptimizeShuffle feature was disabled in #790 for release 0.7.1.2:

https://github.com/databrickslabs/overwatch/blob/f9c8dd088ea0e81d4a311368a5c47b1a4f9ad375/src/main/scala/com/databricks/labs/overwatch/utils/SparkSessionWrapper.scala#L20

JR shuffle factor

Prior to the AQE shuffle auto-optimization described in the previous section, Overwatch release 0.7.1.0 (#563, #661) added the shuffleFactor parameter to the definition of jobRunsModule:

https://github.com/databrickslabs/overwatch/blob/bc136594b1e9d08c77339d6a31f7c87bfe2e7202/src/main/scala/com/databricks/labs/overwatch/pipeline/Silver.scala#L246

Based on this shuffleFactor value of 12.0 and the job cluster configuration used for these tests, Module.optimizeShufflePartitions() currently sets spark.sql.shuffle.partitions to 9600:

https://github.com/databrickslabs/overwatch/blob/3f4cdca4d8438550b13c38fec43eea60bdbee877/src/main/scala/com/databricks/labs/overwatch/pipeline/Module.scala#L79-L102

Results

With AQE shuffle auto-optimization disabled in 0.8.1.2, many Spark stages for JR are forced to run 9600 tasks or more. This results in extended stage durations (grey verticals are minutes) and additional executors, (i.e. cluster auto-scaling, blue verticals) in the following chart from the Spark UI:

This PR introduced the following change in the Silver JR module's configuration in commit https://github.com/databrickslabs/overwatch/commit/d751d5fc75c939892b73f877cb0e5542eb2cc030: https://github.com/databrickslabs/overwatch/blob/d751d5fc75c939892b73f877cb0e5542eb2cc030/src/main/scala/com/databricks/labs/overwatch/pipeline/Silver.scala#L271-L274

Running 0.8.1.2 with the enhancements in this PR, not only are the corresponding Spark job durations decreased but recruitment of additional executors is deferred to later phases of the module:

Note that differences in Spark job labels in these screenshots is due to other provisional features unrelated to Spark configuration or performance that were included in the Overwatch snapshot JAR used for these test runs. Those special labels are aids to development, testing, and troubleshooting to be introduced in #1223. Despite other minor refactoring in the JR module logic the overall contours of the progression of Spark jobs is recognizable because these runs were restricted to only this module for the same date range. Spark job 64 from 20:28 until 20:36 in the upper timeline corresponds to jobRunsAppendMeta (label truncated) at 20:23, for example. This phase in particular seems to be where the overall duration of the module has been decreased most significantly.

Closer examination of the task statistics for corresponding jobs shows that many tasks created in the absence of shuffle auto-optimization are effectively doing nothing!

With shuffle auto-optimization relaxed we see a much more favorable distribution of task durations and bytes processed:

This trend is similarly apparent in other phases of the JR module.

Conclusion

This PR has produced demonstrable performance improvements for the following dataframe transformations:

Next steps

Further gains in resource utilization and time efficiency may be possible in the subsequent phases of the JR module (2011):

neilbest-db commented 1 week ago

0820_release should be moved to the tip of main before completing this PR, therefore I am leaving it in draft status for now.

neilbest-db commented 1 week ago

BTW, there will be merge conflicts. I am perfectly willing to help with those. I made this mess! LMK.

neilbest-db commented 5 days ago

@sriram251-code, the code that changes the values of the Spark UI Job Group IDs has been commented out in this branch per your recommendation (see https://github.com/databrickslabs/overwatch/pull/1253/commits/9fe9f8cedfc1bd988d5c56f44f1ffb55b3536935). I would like to understand the scenarios when the Job Group IDs are the only place to extract certain tokens/IDs. Is it possible to enumerate these scenarios and map the flow of those tokens through the ETL to the target table(s)?