databricks / terraform-provider-databricks

Databricks Terraform Provider
https://registry.terraform.io/providers/databricks/databricks/latest
Other
445 stars 383 forks source link

[ISSUE] Issue with `databricks_job` resource: Job with two tasks keeps flipping the tasks and never reached a stable state #4011

Open fjakobs opened 1 week ago

fjakobs commented 1 week ago

I have Terraform config with a job with two Python file tasks that never reaches a stable state. On each apply, Terraform updates the first task to match the second task defined in the config and updates the second task to match the first one.

Together with #3951 this leads to a situation where a task can become source = GIT even though this is not expected.

I learned from @mgyucht that this doesn't happen if the tasks are ordered in order in the terraform config.

Configuration

terraform {
  required_providers {
    databricks = {
      source  = "Databricks/databricks"
      version = "1.50.0"
    }
  }
}

provider "databricks" {
  host = "https://<HOST>/"
}

resource "databricks_job" "GIT_SOURCE_BUG" {
  format                = "MULTI_TASK"
  name                  = "GIT_SOURCE_BUG"

  environment {
    environment_key = "Default"
    spec {
      client = "1"
    }
  }

  git_source {
    branch   = "main"
    provider = "GitHub"
    url      = "https://gist.github.com/..."
  }

  task {
    environment_key = "Default"
    spark_python_task {
      python_file = "test.py"
      source      = "GIT"
    }
    task_key = "Z"
  }

  task {
    depends_on {
      task_key = "Z"
    }
    environment_key = "Default"
    spark_python_task {
      python_file = "/Workspace/Users/mikhail.kulikov@databricks.com/source.py"
    }
    task_key = "A"
  }
}

Expected Behavior

The second call to terraform apply should detect no changes and be a no-op.

Actual Behavior

terraform apply will always detect changes like this:

# databricks_job.GIT_SOURCE_BUG will be updated in-place
  ~ resource "databricks_job" "GIT_SOURCE_BUG" {
        id                        = "704355916154056"
        name                      = "GIT_SOURCE_BUG"
        # (9 unchanged attributes hidden)

      ~ task {
          ~ task_key                  = "A" -> "Z"
            # (7 unchanged attributes hidden)

          - depends_on {
              - task_key = "Z" -> null
            }

          ~ spark_python_task {
              ~ python_file = "/Workspace/Users/mikhail.kulikov@databricks.com/source.py" -> "test.py"
                # (2 unchanged attributes hidden)
            }

            # (1 unchanged block hidden)
        }
      ~ task {
          ~ task_key                  = "Z" -> "A"
            # (7 unchanged attributes hidden)

          + depends_on {
              + task_key = "Z"
            }

          ~ spark_python_task {
              ~ python_file = "test.py" -> "/Workspace/Users/mikhail.kulikov@databricks.com/source.py"
                # (2 unchanged attributes hidden)
            }

            # (1 unchanged block hidden)
        }

        # (5 unchanged blocks hidden)
    }

Terraform and provider versions

TF provider 1.50.0 Terraform v1.5.7

ivandad1 commented 1 week ago

Hi, I found this workaround to solve it, declare an array of task and sort by task_key

resource "databricks_job" "job" {
  ...
  dynamic "task" {
      for_each = local.task_sorted_list
      content {
        ...
      }
    }
}

variable "task" {
  type = list(object({
    task_key                  = string
    notebook_path             = string
    base_parameters           = optional(map(string), {})
    depends_on                = optional(list(string), [])
    run_if                    = optional(string, "ALL_SUCCESS")
    max_retries               = optional(number, 0)
    min_retry_interval_millis = optional(number, 0)
    retry_on_timeout          = optional(bool, false)
    timeout_seconds           = optional(number, 0)
  }))
}

locals {
  task_sorted_keys = distinct(sort(var.tasks[*].task_key))
  task_sorted_list = flatten([for value in local.task_sorted_keys : [for elem in var.tasks : elem if value == elem.task_key]])
}

Hope this helps :)