Databricks Bundle fails to deploy when using instance pool ids

SLe-Corre commented 1 year ago

We've got a bundle using a pre-defined instance pool:

resources:
  jobs:
# Job config
    lakehouse-test:
      name: "lakehouse testexecutor job"
      email_notifications: {}
      notification_settings:
        alert_on_last_attempt: false
        no_alert_for_canceled_runs: false
        no_alert_for_skipped_runs: false
  # Task settings
      tasks:
        - task_key: set_schema_config
          job_cluster_key: "lakehouse-test-cluster"
          notebook_task:
            notebook_path: ./schemas_and_catalogs.sql
            source: WORKSPACE

  # Cluster config
      job_clusters:
        - job_cluster_key: lakehouse-test-cluster
          new_cluster:
            instance_pool_id: 0725-142644-serum127-pool-6lroh0ct
            driver_instance_pool_id: 0725-142644-serum127-pool-6lroh0ct
            policy_id: 0006B36C7E868AD1
            spark_version: 13.2.x-scala2.12
            aws_attributes: {}
            num_workers: 1

We are using a simple bitbucket pipeline to deploy and execute this on databricks as a service principal. We get the following error message:

+ databricks bundle deploy
Starting upload of bundle files
Uploaded bundle files at /Users/$DATABRICKS_CLIENT_ID/.bundle/lakehouse-test/default/files!
Starting resource deployment
Error: terraform apply: exit status 1
Error: cannot update job: The field 'node_type_id' cannot be supplied when an instance pool ID is provided.
  with databricks_job.lakehouse-test,
  on bundle.tf.json line 63, in resource.databricks_job.lakehouse-test:
  63:       }

We do not specify node_type_id anywhere in our config - this looks like a defaults error where you generate the terraform for the job deployment which is applying a node_type_id even in cases where instance_pool_id is specified, but i don't know enough about go to attempt to fix this myself.

pietern commented 1 year ago

I suspect what's happening is that the cluster policies referred to by policy_id sets the node_type_id, which then conflicts with the instance pool configuration. Could you check the contents of the cluster policy to confirm?

SLe-Corre commented 1 year ago

I thought that might be the case (and double checked) - but it's not explicitly stated.

Entire contents of the policy is:

{
    "instance_pool_id": {
      "type": "fixed",
      "value": "0725-142644-serum127-pool-6lroh0ct"
    },
    "num_workers": {
      "type": "range",
      "maxValue": 5,
      "hidden": false
    },
    "spark_version": {
      "type": "fixed",
      "value": "auto:latest"
    },
    "data_security_mode": {
      "type": "fixed",
      "value": "SINGLE_USER",
      "hidden": true
    },
    "cluster_type": {
      "type": "fixed",
      "value": "job"
    },
    "spark_conf.spark.databricks.cluster.profile": {
      "type": "forbidden",
      "hidden": true
    }
  }

I'll try dropping elements that it could be related to (spark version perhaps? That seems counter intuitive when pools have a preloaded runtime option.

pietern commented 1 year ago

In that case, I suspect the backend is filling in the node_type_id field in the job spec automatically somehow. Perhaps it pulls it from the instance pool on create, and then on update it passes it back again and causes the conflict.

I'll check in with the team to see if this is expected.

pietern commented 1 year ago

@SLe-Corre On second look, the policy defines an instance pool as well, and it's different from the one in the job cluster.

edit: Nevermind, it's only the suffix and it looks manually modified.

SLe-Corre commented 1 year ago

Yeah sorry that was me. Didn't know if that information was identifiable to our instances at all. I've just tested the policy by creating a job manually as a user picking that policy and it runs fine. I'll pull the JSON for it and triple check, but i'd guess it's a default being set.

There was a known issue with the terraform module a ... while ago (Believe issue was closed though!) where node_type_id was a required field in the resource and would populate even if people supplied an instance_pool_id.

pietern commented 1 year ago

I tried reproducing with the same bundle configuration, cluster policy, and instance pool, and it all worked.

When I change any job attributes and redeploy, it works as expected.

Did the job have the same cluster definition when you originally created it, or did you add fields progressively and update?

SLe-Corre commented 1 year ago

Interesting. We haven't created the job - We were expecting the bundle deployment to create the job for us as a service principal, but it fails at the stage of deploying resources.

I'll push a destroy down through the pipeline and try incase we somehow corrupted it early on.

SLe-Corre commented 1 year ago

So initially we grabbed a json config from a job to structure out out job deploy yaml. We've now noticed that since it's deploying via a bitbucket pipeline it's referencing a local state on the build agent, not the state file kept in the databricks workspace. We're looking to resolve this ATM, will update and close this if that's the issue.

SLe-Corre commented 1 year ago

@pietern We've resolved this by manually removing all deployed files from the workspace and doing a fresh deployment.

We could not run destroy - And reading through the console output during pipeline deployment we noticed:

Plan runs "locally" to the pipeline execution agent and persists a plan on the build agent.
Destroy looks for terraform state file locally on the build agent rather than on the workspace.

As a result destroy fails - the state file is stored on the databricks workspace in the .bundle directory.

Sync appears to do nothing. Our issue was caused because the bundle CLI is not resolving differences with a state file that it places on the workspace when it's executed from a pipe agent.

So the initial issue title is a red herring - Instance pools work fine. The issue appears to be an inability to detect remote states existing on the workspace, and parts of the bundle CLI executing locally regardless of any previous deploy command. Local execution on a temporary machine (build agents for example) are doomed to fail. This issue could(?) also occur if two people try and collaborate on a bundle that has been deployed to a workspace with .databricks being in .gitignore - They will be resolving only their local state.

Maybe with a "local" environment the workspace status is kept somewhere, and that status isn't being regenerated when executed from a temp environment.

Happy to close this, any advice on proceeding here? New issue?

databricks / cli

Databricks Bundle fails to deploy when using instance pool ids #609