databrickslabs / overwatch

Capture deep metrics on one or all assets within a Databricks workspace
Other
230 stars 65 forks source link

Validate that all billable cluster states have `total_cost` > 0 and not null #1156

Closed gueniai closed 5 months ago

souravbaner-da commented 8 months ago

@gueniai In instances where cluster_ids are identified with the attribute databricks_billable set to true, coupled with total_cost attributes either being NULL or equal to 0, these clusters are predominantly categorized as Serverless SQL Warehouses. The current operational design of the Overwatch (OW) framework does not extend computational support towards these specific types of clusters, rendering this categorization a deliberate and anticipated outcome of the validation framework's analytical processes.

Modifications have been instituted within the validation framework to encompass this rule with enhanced precision, articulated as follows:

The procedural invocation:

        (validationStatus, quarantineStatus) == validateRuleAndUpdateStatus(
          RuleSet(clsf_df.where("cluster_name is not null").where("databricks_billable is true")).add(validateRules(4)),
          tableName, key, validationStatus, quarantineStatus, "validate_greater_than_zero", Overwatch_RunID)

This modification ensures that the validation rule is systematically applied across all cluster_ids presently delineated within the scope of the Overwatch (OW) framework. This implementation leverages a structured query to filter clusters having non-null cluster_name and databricks_billable flagged as true, thereby integrating them into a RuleSet.

The commit in validation framework PR for this specific change is : https://github.com/databrickslabs/overwatch/pull/1071/commits/e4ebf66d1e62e18d9c2fe88479dee07f5623f005

This enhancement of the validation framework represents a refined approach to the rule application process, ensuring a comprehensive assessment of all pertinent cluster_ids under the Overwatch framework's purview, with a focus on enhancing accuracy and operational efficiency.

souravbaner-da commented 6 months ago

@gueniai Upon running the validation framework on top of the 0811 deployments, the issue seems to be fixed now.