databrickslabs / overwatch

Capture deep metrics on one or all assets within a Databricks workspace
Other
230 stars 65 forks source link

External Optimize & Z-Order #1200

Closed mikekimlovelytics closed 6 months ago

mikekimlovelytics commented 6 months ago

We found the following notes on external optimization here:

To externalize the optimize and z-order tasks:

Set externalizeOptimize to true in the runner notebook config (All Workspaces)
If running as a job don’t forget to update the job parameters with the new escaped JSON config string
Create a job to execute the optimize on the workspace you wish (One Workspace - One Job)
Type: JAR
Main Class: com.databricks.labs.overwatch.Optimizer
Dependent Libraries: com.databricks.labs:overwatch_2.12:<LATEST>
Parameters: ["<Overwatch_etl_database_name>"]
Cluster Config: Recommended
DBR Version: 11.3LTS
Photon: Optional
Enable Autoscaling Compute with enough nodes to go as fast as you want. 4-8 is a good starting point
AWS – Enable autoscaling local storage
Worker Nodes: Storage Optimize
Driver Node: General Purpose with 16+ cores
The driver gets very busy on these workloads - recommend 16 - 64 cores depending on the max size of the cluster

I have the following questions:

  1. When it says externalizeOptimize to true in the runner notebook config (All Workspaces), does it mean adding a column in the config table named externalizeOptimize to boolean true for all workspaces?
  2. Can you clarify this section? If running as a job don’t forget to update the job parameters with the new escaped JSON config string We will be running this as a job. What is this new escaped JSON config string? Is there a specific parameter?
  3. When it says Create a job to execute the optimize on the workspace you wish (One Workspace - One Job), does it mean we need to have one optimization job created and run separately on each workspace? Or does it mean just one job can be created and run on any workspace of our choice?
Mahalakshmi1305 commented 6 months ago

Hi@mikekimlovelytics

  1. There is no need to add any "externalizeOptimize" column in the config table.
  2. While configuring the optimizer job, make sure to add your ETL database name in the parameter ["etl_database_name"]
  3. In a multi-workspace deployment, one optimizer job is sufficient; there's no need to run it on all workspaces. However, in a single workspace deployment, we need to run the optimizer job in each workspace Ex: single workspace: one optimizer job multi-workspace: one optimizer job

Thanks Mahalakshmi