[ISSUE] Issues with `Quality Monitor` resource

willcorb commented 3 months ago

Configuration

resource "databricks_quality_monitor" "testTimeseriesMonitor" {
  table_name         = "${databricks_catalog.sandbox.name}.${databricks_schema.things.name}.${databricks_sql_table.myTestTable.name}"
  assets_dir         = "/Shared/provider-test/databricks_quality_monitoring/${databricks_sql_table.myTestTable.name}"
  output_schema_name = "${databricks_catalog.sandbox.name}.${databricks_schema.things.name}"
  time_series {
    granularities = ["1 hour"]
    timestamp_col = "timestamp"
  }
}

Expected Behavior

The example above is from the documentation. It makes no difference if it's our code of that in the example. We would expect the terraform to create the two metric tables and dashboard before the 'terraform' apply comes back as completed. We would also expect the terraform to destroy all the resources including the tables and dashboard after a destroy is requested.

We would also expect all the parameters to be documented in the documentation for the latest provider (1.49.1 at this time)

Actual Behavior

the 'terraform apply' confirms success for the monitor but it does not wait for the dashboard or 2 metric tables to be created. This means we cannot use a 'depends_on' to apply permissions to the two tables (since they are deployed through a SP token).

It takes anywhere between 8 to 20 minutes for the tables to be created (after the terraform reports success on apply).

'terraform destroy' does not remove the metric tables or dashboard only the monitor so these resources are orphaned. So we have to manually remove these assets and cannot automate properly.

When trying to create a inference_log quality monitor we get: 'Error: cannot create quality monitor: The analysis_config field must be provided.' analysis_config is not documented so we don't know what it references?

Steps to Reproduce

mgyucht commented 3 months ago

Hi @willcorb, thank you for filing this issue.

I'm following up with the backend team responsible for this service. The Terraform provider can be extended to wait for these tables to be created on quality monitor creation and to delete these tables on quality monitor deletion. However, I want to confirm whether this should be done directly in creation/deletion via the API. Otherwise, any API users who try to create/delete quality monitors will have the same issues.

That error message is also not great. analysis_config is not part of the REST API directly. I can see that this corresponds to one of inference_log, time_series, generic, or rag_studio. Can you share the template you're using to create the inference_log quality monitor? I will also ask the API team to return a more usable error message.

willcorb commented 3 months ago

Hi @mgyucht at the moment I've implemented a variable sleep variable to the terraform which can allow time for the tables to be created and then apply grants, although this is not idea because we are guessing how long it will take.

I've also implemented destroys for the two tables and dashboard using local-exec api calls. This appears to work, but again is not ideal. The monitor should really destroy the child artefacts.

Here is the terraform we use for the quality monitor, where we allow the type to be selected and populated accordingly.

resource "databricks_quality_monitor" "quality_monitor" {
  provider           = databricks.workspace
  table_name         = "${var.catalog_name}.${var.schema_name}.${var.table_name}"
  assets_dir         = "/Shared/${var.domain}/${var.schema_name}/${var.assets_dir}"
  output_schema_name = "${var.output_catalog}.${var.schema_name}"
  warehouse_id       = data.databricks_sql_warehouse.sql_warehouse_type.id
  schedule {
    quartz_cron_expression = var.cron_expression
    timezone_id            = "UTC"
  }
  dynamic "time_series" {
    for_each = var.monitor_type == "time_series" ? [1] : []
    content {
      granularities = var.time_series_config.granularities
      timestamp_col = var.time_series_config.timestamp_col
    }
  }

  dynamic "inference_log" {
    for_each = var.monitor_type == "inference_log" ? [1] : []
    content {
      granularities  = local.inference_log_config.granularities
      timestamp_col  = local.inference_log_config.timestamp_col
      model_id_col   = local.inference_log_base_config.model_id_col
      prediction_col = local.inference_log_base_config.prediction_col
      problem_type   = local.inference_log_base_config.problem_type
      # Optional fields
      label_col            = lookup(local.inference_log_optional_config, "label_col", null)
      prediction_proba_col = lookup(local.inference_log_optional_config, "prediction_proba_col", null)
    }
  }

  dynamic "snapshot" {
    for_each = var.monitor_type == "snapshot" ? [1] : []
    content {}
  }
}

resource "databricks_permissions" "quality_monitor_dashboard_usage" {
  provider     = databricks.workspace
  count        = length(var.read_groups) == 0 ? 0 : 1
  dashboard_id = databricks_quality_monitor.quality_monitor.dashboard_id
  depends_on   = [databricks_quality_monitor.quality_monitor]

  dynamic "access_control" {
    for_each = toset(var.read_groups)
    content {
      group_name       = access_control.value
      permission_level = "CAN_RUN"
    }
  }
}

adamwrobel-ext-gd commented 1 month ago

I am facing the same problem just calling update() on the Databricks Python SDK for a quality monitor. This happens when I try to update the schedule field.

db-wenxin commented 1 month ago

@adamwrobel-ext-gd Can you try something like this ( load additional parameters from get() output):

`monitor = client.quality_monitors.get(table_name=table_name)

client.quality_monitors.update( table_name=table_name, output_schema_name=monitor.output_schema_name, snapshot=monitor.snapshot, time_series=monitor.time_series, data_classification_config=monitor.data_classification_config, schedule=None # Set schedule to None to stop the table monitor )`

databricks / terraform-provider-databricks