databricks / databricks-asset-bundles-dais2023

Other
47 stars 57 forks source link

pip reininstall is executed in every single task #8

Closed jmeidam closed 1 year ago

jmeidam commented 1 year ago

I am trying to convert a current dbx project to bundles. I have some tasks of type python_wheel_task.

One such tasks looks like this (they're all similar):

        - task_key: "data_raw"
          depends_on:
            - task_key: "process_init"
          job_cluster_key: "somejobcluster"
          python_wheel_task:
            package_name: "myproject"
            entry_point: "data_raw"
          libraries:
            - whl: ./dist/myproject-*.whl

and I have defined the following artifact:

    artifacts:
      the_wheel:
        type: whl
        path: .
        build: poetry build

In dbx, the wheel would be installed once on the job-cluster. Now I noticed that every task is converted to a notebook that contains the following code:

%pip install --force-reinstall /Workspace/Shared/dbx/projects/myproject/.internal/.../myproject-0.0.0-py3-none-any.whl

This seems rather wasteful of running time if you have many tasks that do small things on the same cluster.

Am I missing a setting, or is this done by design?

andrewnester commented 1 year ago

At the moment this is done by design. We're aware of the concerns and working on a path forward.

You can follow this issue for more updates https://github.com/databricks/cli/issues/783

As a data point, could you please share what Database runtime version you are using for clusters for your Python wheel jobs?

Thanks!

jmeidam commented 1 year ago

Hi Andrew, thanks for the link.

I am using 11.3.x-scala2.12

andrewnester commented 1 year ago

The change to improve this issue was just released in CLI version 0.206.0, feel free to give it a try. Since you're using runtime 11.3.x, please upgrade to use DBR 13.2+ since the fix is only applicable there.

You can find more details here https://github.com/databricks/cli/pull/797

pietern commented 1 year ago

The linked PR has been merged and Python wheel tasks are no longer wrapped by a notebook by default.