databricks / cli

Databricks CLI
Other
148 stars 56 forks source link

Feature request: configure all-purpose cluster libraries through DAB #1860

Open rsayn opened 3 weeks ago

rsayn commented 3 weeks ago

Describe the issue

Since 0.229.0 all-purpose (interactive) clusters can be created via DAB.

With Job clusters, it's pretty straightforward to install a DAB wheel artifact by specifying the libraries for a task executed on that cluster.

With All-purpose clusters this is currently not possible, and the only solution is to perform post-operations with the SDK or APIs to add a library programmatically.

Configuration

bundle:
  name: demo-dab
  databricks_cli_version: 0.231.0

artifacts:
  default:
    type: whl
    build: poetry build
    path: .

resources:
  clusters:
    interactive:
      cluster_name: ${bundle.name} cluster
      data_security_mode: SINGLE_USER
      # [...] cluster config pointing to an all-purpose policy ID
      # these next lines are currently not valid
      libraries:
        - whl: "../dist/*.whl"

Expected Behavior

There should be a way to specify the deployed bundle wheel as a dependency.

Actual Behavior

There's currently no way to specify this behaviour. The wheel needs to be post-attached to the cluster via the SDK by:

  1. Retrieving the cluster's ID
  2. Attaching libraries

Note that both steps would greatly benefit from the substitution happening inside DABs - without it, the cluster name and library path have to be inferred somehow.

OS and CLI version

Is this a regression?

No, this is a new feature request

Debug Logs

N/A

andrewnester commented 2 weeks ago

Hi @rsayn ! Thanks for reporting the issue. Just to confirm: when you run a workflow with this cluster, the library is not installed as well?

rsayn commented 2 weeks ago

Hey @andrewnester! If I define jobs to run on this cluster I can include libraries from the job / task definition. However, my use case here is to boot an interactive small cluster for dev / debugging things via attached notebooks, and I'd like to avoid the overhead of manually installing the project wheel that I deploy through DABs.

My request comes from the fact that you can specify cluster-scoped libraries from the Databricks UI, the SDK or via a cluster policy, but not via DABs.

andrewnester commented 2 weeks ago

@rsayn thanks for clarifying, it makes sense. My expectation was that in the configuration like you have libraries will be installed when the cluster is started (when corresponding job is started). If that's not the case, this has to be fixed on our side and I'll look into this

rsayn commented 2 weeks ago

All right, thanks a lot! To further clarify: I think (please confirm) all-purpose clusters can still be used for jobs.

In that case, I'd expect any library configured on the job's tasks to override the default cluster libraries (which I think is the current behaviour if you attach libraries to a cluster policy) 🤔

andrewnester commented 2 weeks ago

I think I might have misunderstood original issue. In any case, even if you use interactive cluster, you can use it in the job tasks. But for libraries to be installed, you need to specify them at libraries section in tasks not in clusters so it could look like

resources:
  clusters:
    test_cluster:
      cluster_name: "test-cluste"
      spark_version: "13.3.x-snapshot-scala2.12"
      num_workers: 1
      data_security_mode: USER_ISOLATION

  jobs:
    some_other_job:
      name: "[${bundle.target}] Test Wheel Job"
      tasks:
        - task_key: TestTask
          existing_cluster_id: "${resources.clusters.test_cluster.cluster_id}"
          python_wheel_task:
            package_name: my_test_code
            entry_point: run
            parameters:
              - "one"
              - "two"
          libraries:
            - whl: ./dist/*.whl
rsayn commented 2 weeks ago

Exactly. In my case I don't have any jobs attached to the cluster, so I can't use the setup you provided

rsayn commented 1 week ago

Hello @andrewnester, any news about this? 🙏 LMK if I can help in any way!