[ISSUE] Issue with `databricks_global_init_script` resource

robinfrankhuizen commented 1 year ago

We ran into the following issue when using terraform to deploy Databricks on Azure.

Configuration

Our configuration is subdivided into modules but the relevant parts are below.

# Azure part
resource "azurerm_databricks_workspace" "databricks" {
  name                                                      = "dbw-my-databricks-123"
  resource_group_name                            = "rg-my-rg-123"
  managed_resource_group_name           = "rg-my-rg-123-dbw"
  location                                                  = "westeurope"
  sku                                                          = "premium"
  infrastructure_encryption_enabled         = true
  customer_managed_key_enabled          = false
  public_network_access_enabled             = false
  network_security_group_rules_required = "AllRules"

  custom_parameters {
    virtual_network_id                                                           = var.network.vnet_id
    no_public_ip                                                                    = true
    public_subnet_name                                                       = var.public_subnet_name
    private_subnet_name                                                      = var.private_subnet_name
    public_subnet_network_security_group_association_id  = var.public_nsg_id
    private_subnet_network_security_group_association_id = var.private_nsg_id
    machine_learning_workspace_id                                     = null

  }
}

# Databricks part
resource "databricks_global_init_script" "dbw_global_init_script" {
  source  = "${path.module}/init.sh"
  name    = local.initscript_name
  enabled = var.global_init_script_enabled
}

resource "databricks_workspace_conf" "dbw_general_settings" {
  custom_config = local.custom_config
}

resource "databricks_ip_access_list" "allowed-list" {
  count        = length(var.ip_allow_list) > 0 ? 1 : 0
  label        = "allow-in"
  list_type    = "ALLOW"
  ip_addresses = var.ip_allow_list
}

resource "databricks_cluster" "SparkCluster" {
  for_each                = local.clusters
  cluster_name            = each.key
  spark_version           = try(each.value["SparkVersion"], local.cluster_default.default.spark_version)
  node_type_id            = try(each.value["DatabricksNodeType"], local.cluster_default.default.databricksnodetype)
  driver_node_type_id     = try(each.value["DatabricksDriverType"], try(each.value["DatabricksNodeType"], local.cluster_default.default.databricksnodetype))
  autotermination_minutes = try(each.value["AutoTermination"], local.cluster_default.default.auto_termination)
  is_pinned               = local.is_pinned
  spark_conf              = merge(try(each.value["SparkSettings"], {}), local.spark_config)
  custom_tags             = local.custom_tags
  azure_attributes {
    availability = try(each.value["Availability"], local.cluster_default.default.availability)
  }
  dynamic "autoscale" {
    for_each = try(each.value["AutoScale"], local.cluster_default.default.auto_scale) ? toset(["once"]) : toset([])
    content {
      min_workers = try(each.value["MinWorkers"], local.cluster_default.default.min_workers)
      max_workers = try(each.value["MaxWorkers"], local.cluster_default.default.max_workers)
    }
  }
  num_workers    = !try(each.value["AutoScale"], local.cluster_default.default.auto_scale) ? try(each.value["NumWorkers"], local.cluster_default.default.num_workers) : null
  spark_env_vars = var.spark_env_vars

}

locals {
  is_pinned       = true
  initscript_name = "pip-init-script"

  custom_config = {
    "enableExportNotebook" : "true"
    "enableDbfsFileBrowser" : "true"
    "enableNotebookTableClipboard" : "true"
    "enableWorkspaceFilesystem" : "true"
    "enableProjectsAllowList" : "false"
    "enableTokensConfig" : "false"
    "enableIpAccessLists" : "true"
  }
  spark_config = {
    "spark.databricks.cluster.profile"                             = "serverless"
    "spark.databricks.passthrough.enabled"                 = true
    "spark.databricks.delta.preview.enabled"                = true
    "spark.databricks.pyspark.enableProcessIsolation" = true
    "spark.databricks.repl.allowedLanguages"              = "python,sql"
  }
  custom_tags = {
    "ResourceClass" = "Serverless"
  }
  clusters = var.clusters
  cluster_default = {
    default = {
      databricksnodetype = "Standard_DS3_v2"
      spark_version           = data.databricks_spark_version.latest_LT.id
      availability                = "SPOT_AZURE"
      min_workers            = 1
      max_workers           = 2
      num_workers           = 1
      auto_scale              = false
      auto_termination   = 30
    }
  }
}

Expected Behavior

A Databricks workspace is deployed and a global init script is added.

Actual Behavior

An error occurred:

╷
│ Error: cannot create global init script: com.microsoft.azure.storage.StorageException: The specified resource does not exist.
│ 
│   with module.databricks.module.databricks_config.databricks_global_init_script.dbw_global_init_script,
│   on ../modules/databricks/submodules/config/main.tf line 1, in resource "databricks_global_init_script" "dbw_global_init_script":
│    1: resource "databricks_global_init_script" "dbw_global_init_script" {

Steps to Reproduce

terragrunt apply

Terraform and provider versions

hashicorp/azurerm v3.34.0 databricks/databricks v1.3.0

Debug Output

2023-01-04T13:39:15.5915734Z 2023-01-04T13:39:15.591Z [DEBUG] provider.terraform-provider-databricks_v1.3.0: 503 Service Unavailable {
2023-01-04T13:39:15.5916419Z   "error_code": "TEMPORARILY_UNAVAILABLE",
2023-01-04T13:39:15.5916928Z   "message": "com.microsoft.azure.storage.StorageException: The specified resource does not exist."
2023-01-04T13:39:15.5917508Z }: timestamp=2023-01-04T13:39:15.590Z
2023-01-04T13:39:15.5918574Z 2023-01-04T13:39:15.591Z [WARN]  provider.terraform-provider-databricks_v1.3.0: /api/2.0/global-init-scripts:503 - com.microsoft.azure.storage.StorageException: The specified resource does not exist.: timestamp=2023-01-04T13:39:15.591Z
2023-01-04T13:39:15.5920044Z 2023-01-04T13:39:15.591Z [WARN]  provider.terraform-provider-databricks_v1.3.0: /api/2.0/global-init-scripts:503 - com.microsoft.azure.storage.StorageException: The specified resource does not exist.: timestamp=2023-01-04T13:39:15.591Z
2023-01-04T13:39:15.5923305Z 2023-01-04T13:39:15.591Z [ERROR] provider.terraform-provider-databricks_v1.3.0: Response contains error diagnostic: diagnostic_detail= tf_proto_version=5.3 tf_provider_addr=provider tf_req_id=764929f3-3d1e-1cb2-b859-ba873794b0dc tf_resource_type=databricks_global_init_script @caller=/home/runner/work/terraform-provider-databricks/terraform-provider-databricks/vendor/github.com/hashicorp/terraform-plugin-go/tfprotov5/internal/diag/diagnostics.go:55 tf_rpc=ApplyResourceChange diagnostic_severity=ERROR diagnostic_summary="cannot create global init script: com.microsoft.azure.storage.StorageException: The specified resource does not exist." @module=sdk.proto timestamp=2023-01-04T13:39:15.591Z
2023-01-04T13:39:15.5925766Z 2023-01-04T13:39:15.592Z [ERROR] vertex "module.databricks.module.databricks_config.databricks_global_init_script.dbw_global_init_script" error: cannot create global init script: com.microsoft.azure.storage.StorageException: The specified resource does not exist.

Important Factoids

We cannot reproduce the issue consistently. Sometimes we run into it with every apply; sometimes we can apply our configuration without any issues.
We have checked if the storage account that Databricks provisions is actually provisioned before the global init script gets posted and it is.
Further to the second point, we have tried adding a sleep of 2 minutes between provisioning the workspace and the rest of the configuration but ran into the same issue.
We use terragrunt to apply our configuration.

alexott commented 1 year ago

I would say that this is bug in platform or something like that. Maybe azurerm provider reports success too early. But it's better to escalate there, as it less dependent on Databricks provider and more on readiness of Azure workspace.

abij commented 1 year ago

You might be right, that it's maybe more an Azure related issue. Not the Databricks provider calling the API.

I have the feeling that the Storage-Account is ready, but the containers are not. If the global-init-script is created directly after the Workspace becomes available, the storage exception appears (sometimes). The Global init script is saved on one of the containers on the Storage-Account. Other things like Workspace-settings or groups/users are created without any issue (ever).

nfx commented 1 year ago

Following up - is this issue still relevant?

abij commented 1 year ago

We have solved it by adding a time_sleep with a dependson the workspace id using on_create 30s timeout. The global init script waits 30 second with depends_on this time_sleep.

alexott commented 1 year ago

Closing as it was solved by adding the delay after workspace creation

matt-carr commented 1 year ago

For future searchers: also encountered this issue with the following error message in the logs:

HTTP/2.0 503 Service Unavailable
 {
   "error_code": "TEMPORARILY_UNAVAILABLE",
   "message": "Missing credentials to access Azure container"
 }

Seems like a similar issue, adding a wait similarly seems to have solved it

databricks / terraform-provider-databricks