databrickslabs / overwatch

Capture deep metrics on one or all assets within a Databricks workspace
Other
221 stars 60 forks source link

adjust Silver Job Runs module configuration #1256

Open neilbest-db opened 2 days ago

neilbest-db commented 2 days ago

enable auto-optimized shuffle for module 2011

originally implemented for Spark 3.1.2 in commit https://github.com/databrickslabs/overwatch/commit/d751d5fc75c939892b73f877cb0e5542eb2cc030 on branch 1228-silver-job-runs-spark312-r0812 as part of #1253.

This PR removes all of the new utilities and transformation refactoring that were only aids to development and testing. They did not impact performance in any significant way.

The essential change brought to this branch (1228-optimization-only) is entirely expressed in commit https://github.com/databrickslabs/overwatch/commit/8c9ee79d20a4904ecd5aa2908715179c58e615e1. The new code introduced is here: https://github.com/databrickslabs/overwatch/blob/8c9ee79d20a4904ecd5aa2908715179c58e615e1/src/main/scala/com/databricks/labs/overwatch/pipeline/Silver.scala#L271-L274

The background and analysis of the optimization presented in the description of #1253 is still representative of the performance improvements realized by this change.

proof notebook (IN PROGRESS)

Corresponding job runs for before/after comparison of this change:

0.8.1.2 0.8.2.0-SNAPSHOT
Run 647559332994892 (210402 rows in 19.2 mins) Run 455980391763738 (210402 rows in 8.13 mins)
Run 265434635290698 (483230 rows in 20.53 mins) Run 635176874378143 (483230 rows in 8.9 mins)
sonarcloud[bot] commented 2 days ago

Quality Gate Passed Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

neilbest-db commented 2 days ago

Added row counts and timings to second set of comparison runs to table in description. ☝️