[ISSUE] Issue with `databricks_cluster` resource

bgrams commented 2 months ago

Provider does not consider policy-provided configuration when a databricks_cluster is configured with apply_policy_default_values = true. This results in 2 separate issues:

Permanent resource drift
Incorrect validation failures when relevant attributes are not provided directly in the resource configuration. For example, if the policy provides autoscaling or num_workers defaults, as is the case with all Databricks-provided policy families, the cluster validation procedure will fail due to it considering these attributes to contain their zero value if they are unset in the HCL.

Configuration

data "databricks_node_type" "smallest" {
  local_disk = true
}

data "databricks_spark_version" "latest_lts" {
  long_term_support = true
}

resource "databricks_cluster_policy" "policy" {
  name = "test-policy"

  definition = jsonencode({
    "spark_conf.spark.some.property" = {
      type         = "unlimited"
      isOptional   = true
      defaultValue = "value"
    }
    "num_workers" = {
      type  = "fixed"
      value = 1
    }
  })
}

resource "databricks_cluster" "this_will_fail_on_apply" {
  cluster_name            = "test"
  spark_version           = data.databricks_spark_version.latest_lts.id
  node_type_id            = data.databricks_node_type.smallest.id
  autotermination_minutes = 20
  policy_id               = databricks_cluster_policy.policy.id

  apply_policy_default_values = true
}

resource "databricks_cluster" "this_will_drift_forever" {
  cluster_name            = "test"
  spark_version           = data.databricks_spark_version.latest_lts.id
  node_type_id            = data.databricks_node_type.smallest.id
  num_workers             = 1  // this is redundant with the policy configuration, but required to pass validation
  autotermination_minutes = 20
  policy_id               = databricks_cluster_policy.policy.id

  apply_policy_default_values = true

  spark_conf = {
   "spark.some.other.property" = "value"
  }
}

Expected Behavior

Provider respects policy-provided attributes for databricks_cluster resource configuration.

Actual Behavior

Provider does not respect policy-provided attributes for databricks_cluster resource configuration.

Steps to Reproduce

terraform apply -target=databricks_cluster.this_will_fail_on_apply - will fail
terraform apply -target=databricks_cluster.this_will_drift_forever
terraform plan -target=databricks_cluster.this_will_drift_forever - will produce a diff

Terraform and provider versions

Terraform 1.5.6 Provider 1.51.0

Is it a regression?

No

Debug Output

N/A

Important Factoids

No

Would you like to implement a fix?

Possibly

alexott commented 2 months ago

@bgrams 2nd item is explicitly documented: https://registry.terraform.io/providers/databricks/databricks/latest/docs/resources/cluster#policy_id

bgrams commented 2 months ago

Thanks for sharing this, @alexott. Perhaps I could have filed that one as an enhancement rather than an issue - both are included here given that they are likely addressed by the same fix. If you prefer that I create a different issue to track these separately then I can do that.

Re. drift - is this an explicit design decision or do we agree that this is something that could/should be improved? Consider the following use case for cluster policies from the Databricks documentation:

Simplify the user interface and enable more users to create their own clusters

The API does not require that policy configuration defaults also be provided in the request, and having this requirement for the provider alone is a complicating factor for the end user. Therefore I imagine this would be a useful and reasonable improvement, and I don't believe the implementation would be particularly challenging. However I acknowledge that there is likely some relevant historical context that I am missing.

What do you think?

alexott commented 2 months ago

Terraform itself expects fields to be set explicitly - if new value comes from the backend, then the field should be marked as computed, otherwise TF will fail with an error. But marking all fields computed isn't feasible and lead to errors, like, not being able to remove driver node type, etc.

I also just tried with th latest version from repo, and I don't seen any drift with apply_policy_default_values = true as soon as I provide all necessary details.

bgrams commented 2 months ago

Default values don't necessarily need to be computed from the backend - we know the policy at the time of cluster creation and can predict the API behavior.

as soon as I provide all necessary details.

To clarify, this is the complicating factor that I think could be improved. We have many TF-managed clusters across dozens of repos. Keeping all of these up to date whenever there is a policy change is tedious and shouldn't be necessary as it is not required by the Clusters API.

A solution may be to snapshot the policy and store it in state then use this within a CustomizeDiffFunc to suppress an expected diff. I'll see if I can come up with a working example to share, but happy to entertain other ideas as well.

bgrams commented 2 months ago

Here's an example function that correctly suppresses drift when added as a schema.DiffSuppressfunc:

func appliedPolicyDiffSuppressFunc(k, old, new string, d *schema.ResourceData) bool {

    // When a policy_id is provided during cluster creation, we snapshot the policy and 
    // merge the definition with policy_family_definition_overrides. This is then serialized to json
    // and the merged result is persisted at this address
    policyString, ok := d.GetOk("applied_policy")
    if !ok {
        return false
    }

    var appliedPolicy map[string]map[string]interface{}
    err := json.Unmarshal([]byte(policyString.(string)), &appliedPolicy)
    if err != nil {
        return false
    }

    // Policy definition and resource attributes are keyed identically
        // This couldn't possibly be more convenient
    policy, ok := appliedPolicy[k]
    if !ok {
        return false
    }

    // Fixed attributes will always be applied
    if policy["type"].(string) == "fixed" {
        return new == "" && old == policy["value"].(string)
    }

    // Default attributes should only be considered when apply_policy_default_values is true
    applyDefaults, ok := d.GetOk("apply_policy_default_values")
    if ok && applyDefaults.(bool) {
        if defaultValue, ok := policy["defaultValue"]; ok {
            return new == "" && old == defaultValue.(string)
        }
    }

    return false
}

Tested using spark_conf but I think it generalizes for all cluster attributes.

Thoughts?

alexott commented 2 months ago

That's solving only a problem of diff. How would you signal TF that the policy has been updated on the other side? Right now you can build cluster configuration driven by the cluster policy definition using TF code itself, no provider modification required

bgrams commented 2 months ago

Yes, that is precisely the problem that I am hoping to solve. The diff presented in the plan under these conditions is otherwise meaningless and incorrect as it has no effect on the backend result, so why do we need it?

The applied_policy in the above solution can be recomputed upon cluster edit or when the policy_id changes. There are some edge cases I can think of where this would not be a perfect solution (e.g. a sequence of several external policy updates and cluster restarts over time) but I'm not sure that capturing these is possible without additional API support. In those cases, plan behavior would match what exists today, so suppressing diff of "known" policy-defined attributes would still be a net improvement.

I think the only alternative would be to mark all fields as computed and build the full request in the provider using policy values, which I imagine is similar to the client logic in the workspace UI, but to your point that may not be a feasible approach (though maybe this is worth revisiting).

bgrams commented 2 months ago

Hey @alexott I'd love if this capability would land in a future release - would you like me to open a PR or is there anything else we should consider before then?

alexott commented 2 months ago

The engineering team should discuss this. I think the proposed solution won't help with cases when policy is updated outside of the cluster definition and this will lead to more issues - we have quite a lot of edge cases already when the computed is used (and we need it to specify that field is generated outside of Terraform).

In the meantime you can use cluster policy data source to populate cluster attributes from policy, like this: https://github.com/databricks/terraform-provider-databricks/issues/1787#issuecomment-1371886145

Cc: @mgyucht

databricks / terraform-provider-databricks