databricks / cli

Databricks CLI
Other
132 stars 50 forks source link

How to add the init_scripts to databricks.yaml file? #1660

Closed e-gunduz closed 1 month ago

e-gunduz commented 1 month ago

Describe the issue

I would like to add my init_script to my job clusters but could not find a documentation or any example code? Is this supported?

Configuration

This is part of my job cluster configuration in .yaml file

      job_clusters:
        - job_cluster_key: my_job_cluster_multi_node
          label: default
          new_cluster:
            cluster_name: ""
            spark_version: 14.2.x-cpu-ml-scala2.12
            azure_attributes:
              first_on_demand: 1
              availability: SPOT_WITH_FALLBACK_AZURE
              spot_bid_max_price: 100
            node_type_id: Standard_DS4_v2
            enable_elastic_disk: true
            data_security_mode: SINGLE_USER
            autoscale:
              min_workers: 2
              max_workers: 8
            policy_id: '123456789ABCD'
            runtime_engine: STANDARD

I have tried to add this, similar to cluster JSON configs.

      job_clusters:
        - job_cluster_key: my_job_cluster_multi_node
          label: default
          new_cluster:
            cluster_name: ""
            spark_version: 14.2.x-cpu-ml-scala2.12
            azure_attributes:
              first_on_demand: 1
              availability: SPOT_WITH_FALLBACK_AZURE
              spot_bid_max_price: 100
            node_type_id: Standard_DS4_v2
            enable_elastic_disk: true
            data_security_mode: SINGLE_USER
            autoscale:
              min_workers: 2
              max_workers: 8
            policy_id: '123456789ABCD'
            runtime_engine: STANDARD
            init_scripts:
              - volumes:
                destination: "path to init_script.sh"

Steps to reproduce the behavior

How I deploy the code?

  1. Run databricks bundle deploy -t test

I get

  on bundle.tf.json line 1102, in resource.databricks_job.my_pipeline.job_cluster[0].new_cluster.init_scripts[0].volumes:
1102:                   "volumes": {}

The argument "destination" is required, but no definition was found.

If I replace the init_scripts config with this

            init_scripts:
              - volumes:
                - destination: "path to init_script.sh"

It works but init_script is not defined in the cluster configs in the UI.

OS and CLI version

Databricks CLI version: v0.219.0 OS: MacOS

Is this a regression?

This never worked.

Debug Logs

Repeatedly:

  on bundle.tf.json line 1102, in resource.databricks_job.my_pipeline.job_cluster[0].new_cluster.init_scripts[0].volumes:
1102:                   "volumes": {}

The argument "destination" is required, but no definition was found.
andrewnester commented 1 month ago

It seems like correct configuration should be

init_scripts:
  volumes:
    - destination: "path to init_script.sh"
pietern commented 1 month ago

On v0.219.0 you should also see warnings on incorrect configuration when running databricks bundle validate.

The correct configuration (per the API docs) should be:

init_scripts: # Array with objects
  - volumes: # Single key object
      destination: /path
e-gunduz commented 1 month ago

Ok the problem was indentation. It was parsed

init_scripts=['volumes':{}, 'destination': '/path']

now it is

init_scripts=['volumes':{'destination': '/path'}]

Thank you for that :)