databricks / cli

Databricks CLI
Other
115 stars 40 forks source link

DAB deployment fails with `Error: cannot create job: NumWorkers could be 0 only for SingleNode clusters` #1546

Open m-o-leary opened 2 days ago

m-o-leary commented 2 days ago

Describe the issue

When running databricks bundle deploy I get the following error

Potenitally sensitive info replaced with <internal ...>

Updating deployment state...
Error: terraform apply: exit status 1

Error: cannot create job: NumWorkers could be 0 only for SingleNode clusters. See https://docs.databricks.com/clusters/single-node.html for more details

  with <internal key>,
  on bundle.tf.json line 87, in resource.databricks_job.<internal key>:
  87:       },
...

The cluster definition at that point is:

{
  "job_cluster_key": "<internal key>",
  "new_cluster": {
    "aws_attributes": {
      "first_on_demand": 1,
      "instance_profile_arn": "<internal arn>"
    },
    "custom_tags": {
      "env": "dev",
      "owner": "datascience",
      "role": "databricks",
      "vertical": "datascience"
    },
    "data_security_mode": "SINGLE_USER",
    "node_type_id": "m6i.2xlarge",
    "num_workers": 0,
    "policy_id": "<internal policy id>",
    "spark_conf": {
      "spark.databricks.cluster.profile": "singleNode",
      "spark.databricks.delta.schema.autoMerge.enabled": "true",
      "spark.databricks.sql.initial.catalog.name": "<internal catalog name>",
      "spark.master": "local[*, 4]"
    },
    "spark_version": "13.2.x-cpu-ml-scala2.12"
  }
}

Steps to reproduce the behavior

Please list the steps required to reproduce the issue, for example:

  1. Install databricks CLI @ v0.222.0
  2. Run databricks bundle deploy
  3. See error
  4. Downgrade to v0.221.1
  5. Run databricks bundle deploy`
  6. See no error

Expected Behavior

Bundle should have deployed successfully

Actual Behavior

Bundle failed to deploy at the Updating deployment state... step

OS and CLI version

Please include the version of the CLI (eg: v0.1.2) and the operating system (eg: windows). You can run databricks --version to get the version of your Databricks CLI Databricks: v0.222.0 OS:

PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Is this a regression?

Yes, v0.221.1 works as well as v0.217.0

Maybe related to https://github.com/databricks/cli/issues/592

pietern commented 2 days ago

Thanks for reporting the issue. I'm investigating it.

pietern commented 2 days ago

The following cluster definition works:

          new_cluster:
            node_type_id: i3.xlarge
            num_workers: 0
            spark_version: 14.3.x-scala2.12
            spark_conf:
                "spark.databricks.cluster.profile": "singleNode"
                "spark.master": "local[*, 4]"
            custom_tags:
                "ResourceClass": "SingleNode"

Note the presence of "ResourceClass": "SingleNode".

This may be able to get you unblocked while we figure out the underlying cause of this issue.

pietern commented 2 days ago

A change in the Terraform provider (PR, released as part of v1.48.0) caused additional validation to run for job clusters. This includes a check for the ResourceClass field under custom_tags and that's why this error shows up if it isn't specified.

You can mitigate by including the following stanza in your job cluster definition:

custom_tags:
    "ResourceClass": "SingleNode"

Meanwhile, we're figuring out if this is something we should include transparently or not.

georgealexanderday commented 2 days ago

The inclusion of the custom tag has helped resolve our issue - appreciate the support here.

drelias15 commented 2 days ago

I am seeing the same issue after running my ci pipeline. My job configuration has the below lines, but still seeing the issue. Can you please help?

spark_conf:
                "spark.databricks.cluster.profile": "singleNode"
                "spark.master": "local[*, 4]"
            custom_tags:
                "ResourceClass": "SingleNode"
pietern commented 1 day ago

@drelias15 Are you sure you have the indentation right?

custom_tags needs to be at the same level as spark_conf.

drelias15 commented 1 day ago

I believe the indentation is good. This snapshot is from the yml of the deployed job. One point, the spark.databricks.cluster.profile is defined in the cluster policy, so they are not defined in the workflows. Do you think that need to be present in the workflow definition?

image