We found the following notes on external optimization here:
To externalize the optimize and z-order tasks:
Set externalizeOptimize to true in the runner notebook config (All Workspaces)
If running as a job don’t forget to update the job parameters with the new escaped JSON config string
Create a job to execute the optimize on the workspace you wish (One Workspace - One Job)
Type: JAR
Main Class: com.databricks.labs.overwatch.Optimizer
Dependent Libraries: com.databricks.labs:overwatch_2.12:<LATEST>
Parameters: ["<Overwatch_etl_database_name>"]
Cluster Config: Recommended
DBR Version: 11.3LTS
Photon: Optional
Enable Autoscaling Compute with enough nodes to go as fast as you want. 4-8 is a good starting point
AWS – Enable autoscaling local storage
Worker Nodes: Storage Optimize
Driver Node: General Purpose with 16+ cores
The driver gets very busy on these workloads - recommend 16 - 64 cores depending on the max size of the cluster
I have the following questions:
When it says externalizeOptimize to true in the runner notebook config (All Workspaces), does it mean adding a column in the config table named externalizeOptimize to boolean true for all workspaces?
Can you clarify this section? If running as a job don’t forget to update the job parameters with the new escaped JSON config string We will be running this as a job. What is this new escaped JSON config string? Is there a specific parameter?
When it says Create a job to execute the optimize on the workspace you wish (One Workspace - One Job), does it mean we need to have one optimization job created and run separately on each workspace? Or does it mean just one job can be created and run on any workspace of our choice?
There is no need to add any "externalizeOptimize" column in the config table.
While configuring the optimizer job, make sure to add your ETL database name in the parameter ["etl_database_name"]
In a multi-workspace deployment, one optimizer job is sufficient; there's no need to run it on all workspaces. However, in a single workspace deployment, we need to run the optimizer job in each workspace
Ex: single workspace: one optimizer job
multi-workspace: one optimizer job
We found the following notes on external optimization here:
I have the following questions:
externalizeOptimize
totrue
in the runner notebook config (All Workspaces), does it mean adding a column in the config table namedexternalizeOptimize
to booleantrue
for all workspaces?If running as a job don’t forget to update the job parameters with the new escaped JSON config string
We will be running this as a job. What is thisnew escaped JSON config string
? Is there a specific parameter?Create a job to execute the optimize on the workspace you wish (One Workspace - One Job)
, does it mean we need to have one optimization job created and run separately on each workspace? Or does it mean just one job can be created and run on any workspace of our choice?