databricks / cli

Databricks CLI
Other
148 stars 56 forks source link

Unable to specify extras in a python wheel installation for Databricks Asset Bundles #1602

Open aabilov-dataminr opened 4 months ago

aabilov-dataminr commented 4 months ago

Describe the issue

When packaging a wheel in Python it's standard practice to put some libraries in extras groups. This is commonly used in GPU/ML experimentation repositories to scope dependency groups for specific use-cases or workflows.

When attempting to specify an extras group in the libraries config for a DABs project the bundle build throws an error:

databricks bundle deploy                                  
Building llm-workflows...
Error: file dist/*.whl[train] is referenced in libraries section but doesn't exist on the local file system

Hoping that this can be resolved! The only possible workarounds are of now are very destructive to standard python packaging workflows:

If there are other/better workarounds, I'd love to hear them!

Configuration (shortened for brevity)

In pyproject.toml:

dependencies = [
    "databricks-sdk>=0.29.0"
]

[project.optional-dependencies]
train = [
    "transformers==4.41.2"
]

In databricks.yml (shortened for brevity):

experimental:
  python_wheel_wrapper: true

artifacts:
  llm-workflows:
    type: whl
    path: ./
    build: python3 -m build . --wheel

# ...task config
      tasks:
        - task_key: "task"
          spark_python_task:
            python_file: "./llm_workflows/cli/generate.py"
          libraries:
            - whl: ./dist/*.whl[train]

Steps to reproduce the behavior

databricks bundle deploy

Expected Behavior

Instead of attempting to find a local file ./dist/*.whl[train] the bundle should correctly identify that [train] is an extras group and install the extras appropriately. This is standard behavior in python wheels.

Actual Behavior

Bundle build fails because the wheel file can't be found.

OS and CLI version

OS X, Databricks CLI v0.219.0

Is this a regression?

No

aabilov-dataminr commented 4 months ago

Actually workaround 2 does not work... I tried splitting the repo into two packages, but it seems like DABs cleans up the dist folder in-between wheel builds 😞

Artifacts config:

artifacts:
  llm-workflows-core:
    type: whl
    path: ./
    build: python3 -m build llm_workflows_core --wheel --outdir dist
  llm-workflows-train:
    type: whl
    path: ./
    build: python3 -m build llm_workflows_train --wheel --outdir dist

Deploy run:

databricks bundle deploy                                 
Building llm-workflows-core...
Building llm-workflows-train...
Uploading dm_llm_workflows_core-0.4.3.post8+git.94ebd0cd.dirty.2024.7.17t17.27.32-py3-none-any.whl...
Error: upload for llm-workflows-core failed, error: unable to read /Users/aabilov/git/dm-llm-workflows/dist/dm_llm_workflows_core-0.4.3.post8+git.94ebd0cd.dirty.2024.7.17t17.27.32-py3-none-any.whl: no such file or directory

Expected behavior: When specifying two artifacts to be built to a specific dist folder, I would expect both of them to be in the dist folder:

> python3 -m build llm_workflows_core --wheel --outdir dist
> python3 -m build llm_workflows_train --wheel --outdir dist
> ls dist                                                   
dm_llm_workflows_core-0.4.3.post8+git.94ebd0cd.dirty.2024.7.17t17.30.10-py3-none-any.whl
dm_llm_workflows_train-0.4.3.post8+git.94ebd0cd.dirty.2024.7.17t17.30.15-py3-none-any.whl
pietern commented 4 months ago

Thanks for reporting the issue, @aabilov-dataminr.

We'll take a look at the cleanup of dist in between builds, that seems wrong. In the mean time, you could try having both builds output to a different directory (that won't be cleaned up).

As for proper extras support, we'll take a look as well. If this works at the API, we should keep the extras suffix intact when we glob to find the wheel file.

andrewnester commented 3 months ago

@aabilov-dataminr the fix to support workaround 2 has been merged and released in 0.224.1 version, please give it a try. In the meantime, I'm verifying if Databricks backend support providing libraries with extras, I'll keep this issue updated

j-4 commented 2 months ago

We use another workaround for installing extra dependencies for our integration tests: After specifying the wheel file(s) as cluster dependency, we install the extra dependencies on runtime with a subprocess call.

the test ressource config:

targets:
  test:
    sync:
      include:
        - ../dist/*.whl
    resources: 
      jobs: 
        integration-test: 
          name: integration-test
          tasks:
            - task_key: "main"
              spark_python_task:
                python_file: ${workspace.file_path}/tests/entrypoint.py
              libraries:
                - whl: ../dist/*.whl
              ...
          job_clusters:
            - job_cluster_key: test-cluster
              new_cluster:
                ...
                spark_env_vars:
                  DIST_FOLDER_PATH: ${workspace.file_path}/dist
...

databricks.yml

...
artifacts:
  default:
    type: whl
    path: .
...

and the entrypoint.py file

import os
import subprocess
import sys

if __name__ == "__main__":
    # no bytecode io
    sys.dont_write_bytecode = True
    # install extra dependencies, workaround for https://github.com/databricks/cli/issues/1602
    dist_folder: str = os.environ.get("DIST_FOLDER_PATH")
    if dist_folder is None:
        raise KeyError(
            "The env variable DIST_FOLDER_PATH is not set but is needed to run the tests."
        )
    wheel_files = [os.path.join(dist_folder, f) for f in os.listdir(dist_folder) if f.endswith("whl")]
    for wheel_file in wheel_files:
        subprocess.check_call(
            [sys.executable, "-m", "pip", "install", f"{wheel_file}[test]"]
        )
...