databrickslabs / dbx

🧱 Databricks CLI eXtensions - aka dbx is a CLI tool for development and advanced Databricks workflows management.
https://dbx.readthedocs.io
Other
443 stars 122 forks source link

Wheel file based job deployment not removing previous wheel version from cluster #834

Closed tarique-msci closed 1 year ago

tarique-msci commented 1 year ago

I have a PySpark job that I am deploying as wheel package using dbx on Databricks. The deployment file looks something like below:

custom:
  basic-cluster-props: &basic-cluster-props
    spark_version: "12.2.x-scala2.12" 
    node_type_id: "Standard_D4ds_v5"
    spark_conf:  # remove if not needed
      spark.databricks.delta.preview.enabled: 'true'
      # For postgres connection
      spark.network.timeout: '300000'
    # instance_pool_id: "instance-pool://some-pool-name" # remove if not needed
    # driver_instance_pool_id: "instance-pool://some-pool-name"  # remove if not needed
    runtime_engine: PHOTON
    # init_scripts:  # remove if not needed
    #   - dbfs:
    #       destination: dbfs:/<enter your path>

  basic-auto-scale-props: &basic-auto-scale-props
    autoscale:
      min_workers: 2
      max_workers: 4

  basic-static-cluster: &basic-static-cluster
    new_cluster:
      <<: *basic-cluster-props
      num_workers: 2

environments:
  default:
    workflows:
      - name: "test-pipeline"
        job_clusters:
          - job_cluster_key: "basic-cluster"
            <<: *basic-static-cluster
          - job_cluster_key: "basic-autoscale-cluster"
            <<: *basic-autoscale-cluster
        tasks:
          - task_key: "spark-etl"
            job_cluster_key: "basic-cluster"
            python_wheel_task:
              package_name: "dbx_test"
              entry_point: "main"

I have configured a job cluster for the actual job run but during development I use an all purpose cluster. To do the same I go to the Workflows UI on Databricks and swap out the cluster. For the first run this works fine. But if I modify the code, redeploy, swap the cluster and run it again, it installs the new wheel file on the cluster without removing the earlier one. And it run the older implementation instead of new one with changes. To run the new one I have to uninstall the wheel file from the cluster manually, restart the cluster and then only it works.

Example screenshot of libraries installed in cluster.

image

Expected Behavior

When a new version of the job in deployed and launched on a cluster the older wheel should be uninstalled.

Current Behavior

The new version of the wheel package is installed on the cluster without removing the older version and the run refers to the old wheel package.

Steps to Reproduce (for bugs)

After this you will have multiple version of the same wheel package installed on the cluster.

Context

Your Environment

renardeinside commented 1 year ago

hi @tarique-msci , this is expected behavior on interactive (all-purpose clusters). They're not recommended for job execution with wheels, as stated here - they don't support removing the wheel without restarting the cluster. If you want to make a development run on all-purpose cluster, use dbx execute. Here is a doc for reference - https://dbx.readthedocs.io/en/latest/guides/python/python_quickstart/#executing-code-on-databricks