databrickslabs / overwatch

Capture deep metrics on one or all assets within a Databricks workspace
Other
230 stars 65 forks source link

Reconciling cost metrics with account usage #145

Closed j0nnyr0berts closed 3 years ago

j0nnyr0berts commented 3 years ago

Hi, I'd like to use Overwatch to monitor Databricks spend during the day. I've successfully run the modules required to generate the clusterstatefact table, however when I try and compare any of aggregated costs against the account usage page, they have a similar shape to, but differ considerably in, $ value.

I am matching the interactiveDBUPrice / automatedDBUPrice compute cost variables in Overwatch with the All purpose compute and jobs compute inputs to the account pricing section options in Account usage.

I have tried comparing the output the Overwatch and account usage for the same job id run but am able to reconcile any of the Overwatch column to the dbus or machineHours in the account output.

It would be great to know if it's possible to generate the same output as the account usage cost estimation using overwatch so that I may get an hourly view. Failing that, it would be great to understand any differences between the two outputs!

select 
date(timestamp_state_start) as day, 
sum(total_compute_cost), 
sum(total_DBU_cost), 
sum(total_cost) 
from
overwatch.clusterstatefact 
where 
databricks_billable = true 
group by day

For context, I would naively expect the daily sum of total_cost in overwatch to be close to the estimated daily $ I see in the Accounts Usage page. However, the Overwatch value is much higher - often over 2x.

GeekSheikh commented 3 years ago

total_costs include DBU and compute, is that what you're trying to do. Look at the data model, total costs == compute + dbu. Also, make sure your costs reflect the truth, review custom costing section in docs.

Overwatch costs are not exact, they are estimates. They should reflect closely but are unlikely to "match" as they are derived, not looked up from true cost tables

j0nnyr0berts commented 3 years ago

Thanks! Costs aside, do you know if there's any way to reconcile compute hours between the accounts usage stats and Overwatch? I've attached the output for a specific job from both. Trying to understand what 'machineHours' in the accounts page relates to (if anything!) in Overwatch.

account_usage_job_70245.csv overwatch_job_70245.csv

Also, I notice that in Overwatch runs for the current day, automated externally triggered jobs are missing. Is this due to the fact that audit logs are only delivered daily?

rishansanjayDB commented 3 years ago

What cloud platform do you use?

j0nnyr0berts commented 3 years ago

AWS

rishansanjayDB commented 3 years ago

You're correct, logs are delivered once a day. As for the machine hours, Overwatch provides core_hours. You would need to look at core_hours / number of cores on the machine which will result in aggregate machine hours for the cluster across all nodes All that info is available in the 'ClusterStateFact' table.

j0nnyr0berts commented 3 years ago

Thank you!