databricks / terraform-provider-databricks

Databricks Terraform Provider
https://registry.terraform.io/providers/databricks/databricks/latest
Other
457 stars 393 forks source link

[ISSUE] Issue with `databricks_job` resource: Job with two tasks keeps flipping the tasks and never reached a stable state #4011

Open fjakobs opened 2 months ago

fjakobs commented 2 months ago

I have Terraform config with a job with two Python file tasks that never reaches a stable state. On each apply, Terraform updates the first task to match the second task defined in the config and updates the second task to match the first one.

Together with #3951 this leads to a situation where a task can become source = GIT even though this is not expected.

I learned from @mgyucht that this doesn't happen if the tasks are ordered in order in the terraform config.

Configuration

terraform {
  required_providers {
    databricks = {
      source  = "Databricks/databricks"
      version = "1.50.0"
    }
  }
}

provider "databricks" {
  host = "https://<HOST>/"
}

resource "databricks_job" "GIT_SOURCE_BUG" {
  format                = "MULTI_TASK"
  name                  = "GIT_SOURCE_BUG"

  environment {
    environment_key = "Default"
    spec {
      client = "1"
    }
  }

  git_source {
    branch   = "main"
    provider = "GitHub"
    url      = "https://gist.github.com/..."
  }

  task {
    environment_key = "Default"
    spark_python_task {
      python_file = "test.py"
      source      = "GIT"
    }
    task_key = "Z"
  }

  task {
    depends_on {
      task_key = "Z"
    }
    environment_key = "Default"
    spark_python_task {
      python_file = "/Workspace/Users/mikhail.kulikov@databricks.com/source.py"
    }
    task_key = "A"
  }
}

Expected Behavior

The second call to terraform apply should detect no changes and be a no-op.

Actual Behavior

terraform apply will always detect changes like this:

# databricks_job.GIT_SOURCE_BUG will be updated in-place
  ~ resource "databricks_job" "GIT_SOURCE_BUG" {
        id                        = "704355916154056"
        name                      = "GIT_SOURCE_BUG"
        # (9 unchanged attributes hidden)

      ~ task {
          ~ task_key                  = "A" -> "Z"
            # (7 unchanged attributes hidden)

          - depends_on {
              - task_key = "Z" -> null
            }

          ~ spark_python_task {
              ~ python_file = "/Workspace/Users/mikhail.kulikov@databricks.com/source.py" -> "test.py"
                # (2 unchanged attributes hidden)
            }

            # (1 unchanged block hidden)
        }
      ~ task {
          ~ task_key                  = "Z" -> "A"
            # (7 unchanged attributes hidden)

          + depends_on {
              + task_key = "Z"
            }

          ~ spark_python_task {
              ~ python_file = "test.py" -> "/Workspace/Users/mikhail.kulikov@databricks.com/source.py"
                # (2 unchanged attributes hidden)
            }

            # (1 unchanged block hidden)
        }

        # (5 unchanged blocks hidden)
    }

Terraform and provider versions

TF provider 1.50.0 Terraform v1.5.7

ivandad1 commented 2 months ago

Hi, I found this workaround to solve it, declare an array of task and sort by task_key

resource "databricks_job" "job" {
  ...
  dynamic "task" {
      for_each = local.task_sorted_list
      content {
        ...
      }
    }
}

variable "task" {
  type = list(object({
    task_key                  = string
    notebook_path             = string
    base_parameters           = optional(map(string), {})
    depends_on                = optional(list(string), [])
    run_if                    = optional(string, "ALL_SUCCESS")
    max_retries               = optional(number, 0)
    min_retry_interval_millis = optional(number, 0)
    retry_on_timeout          = optional(bool, false)
    timeout_seconds           = optional(number, 0)
  }))
}

locals {
  task_sorted_keys = distinct(sort(var.tasks[*].task_key))
  task_sorted_list = flatten([for value in local.task_sorted_keys : [for elem in var.tasks : elem if value == elem.task_key]])
}

Hope this helps :)

mgyucht commented 1 month ago

This can only be properly addressed when we move to the plugin framework, when we will have much more control over the generated plan and the ability to inspect the config directly.