databricks / cli

Databricks CLI
Other
115 stars 39 forks source link

Variables not updated by `bundle run` #1526

Closed jbpdl22 closed 5 days ago

jbpdl22 commented 5 days ago

Describe the issue

When running a DAB deployed workflow with bundle run --var="myarg=bar" the variables passed are not being used in the workflow run. This is despite them being required to run (cli complains when not given).

Configuration

variables:
  myarg:
    description: Some arg to print

resources:
  jobs:
    print_my_arg:
      name: print_my_arg

      tasks:
        - task_key: notebook_task
          job_cluster_key: job_cluster
          spark_python_task:
            python_file: "../src/print_arg.py"
            parameters:
              - "--myarg"
              - ${var.myarg}

      job_clusters:
        - job_cluster_key: job_cluster
          new_cluster:
            spark_version: 13.3.x-scala2.12
            node_type_id: i3.xlarge
            autoscale:
                min_workers: 1
                max_workers: 1

Steps to reproduce the behavior

  1. Run databricks bundle deploy -t dev --profile dev --var="myarg=foo"
  2. Run databricks bundle run -t dev print_my_arg --profile dev --var="myarg=bar"
  3. The arg passed at deploy time is used during the run and not the var passed when triggering the run

Expected Behavior

Expect that a variable passed at run will be substituted for that run as per the documentation.

Actual Behavior

Variables passed during deploy are used for all runs, regardless of what is passed with bundle run

OS and CLI version

Databricks CLI v0.217.1 MacOS 14.5

Is this a regression?

This is the only version I've tried.

Debug Logs

Can supply if needed.

pietern commented 5 days ago

Thanks for reporting.

I understand how this can be confusing and this is something we can and should improve. The bundle variables you specify are deployment-time variables. They are interpolated in the job definition when you deploy the bundle. When you run a job, it will run a previously deployed job, and not deploy it again before running it.

You can parameterize a job by using job parameters (docs). This looks like this:

resources:
  jobs:
    job_with_parameters:
      name: job_with_parameters

      tasks:
        - task_key: task_a
          spark_python_task:
            python_file: ../src/file.py
            parameters:
              - "--foo={{ job.parameters.foo }}"
              - "--bar={{ job.parameters.bar }}"
              - "--qux={{ job.parameters.qux }}"

          new_cluster:
            node_type_id: i3.xlarge
            num_workers: 1
            spark_version: 14.3.x-scala2.12

      parameters:
        - name: foo
          default: v1
        - name: bar
          default: v1

These parameters can be specified at run-time. For example:

$ databricks bundle run --var="myarg=bar" -- --foo=v2 --bar=v2

Note that you still need to specify the bundle variables because they are part of the bundle configuration (e.g. they could set the name of the parameter itself).

I hope this clarifies things.

Closing because this is working as intended.

jbpdl22 commented 4 days ago

Thanks for the quick response. This makes more sense. A few suggestions:

  1. I think an example in the bundles section of the docs using job parameters would be really helpful
  2. It's confusing to me that the bundle run command requires parameters that are deployment-time-only (if I've understood your explanation correctly). Perhaps those shouldn't be required to run since they do nothing to pass anyway?