Open aabilov-dataminr opened 4 months ago
Actually workaround 2 does not work... I tried splitting the repo into two packages, but it seems like DABs cleans up the dist
folder in-between wheel builds 😞
Artifacts config:
artifacts:
llm-workflows-core:
type: whl
path: ./
build: python3 -m build llm_workflows_core --wheel --outdir dist
llm-workflows-train:
type: whl
path: ./
build: python3 -m build llm_workflows_train --wheel --outdir dist
Deploy run:
databricks bundle deploy
Building llm-workflows-core...
Building llm-workflows-train...
Uploading dm_llm_workflows_core-0.4.3.post8+git.94ebd0cd.dirty.2024.7.17t17.27.32-py3-none-any.whl...
Error: upload for llm-workflows-core failed, error: unable to read /Users/aabilov/git/dm-llm-workflows/dist/dm_llm_workflows_core-0.4.3.post8+git.94ebd0cd.dirty.2024.7.17t17.27.32-py3-none-any.whl: no such file or directory
Expected behavior:
When specifying two artifacts to be built to a specific dist folder, I would expect both of them to be in the dist
folder:
> python3 -m build llm_workflows_core --wheel --outdir dist
> python3 -m build llm_workflows_train --wheel --outdir dist
> ls dist
dm_llm_workflows_core-0.4.3.post8+git.94ebd0cd.dirty.2024.7.17t17.30.10-py3-none-any.whl
dm_llm_workflows_train-0.4.3.post8+git.94ebd0cd.dirty.2024.7.17t17.30.15-py3-none-any.whl
Thanks for reporting the issue, @aabilov-dataminr.
We'll take a look at the cleanup of dist
in between builds, that seems wrong. In the mean time, you could try having both builds output to a different directory (that won't be cleaned up).
As for proper extras support, we'll take a look as well. If this works at the API, we should keep the extras suffix intact when we glob to find the wheel file.
@aabilov-dataminr the fix to support workaround 2 has been merged and released in 0.224.1 version, please give it a try. In the meantime, I'm verifying if Databricks backend support providing libraries with extras, I'll keep this issue updated
We use another workaround for installing extra dependencies for our integration tests: After specifying the wheel file(s) as cluster dependency, we install the extra dependencies on runtime with a subprocess call.
the test ressource config:
targets:
test:
sync:
include:
- ../dist/*.whl
resources:
jobs:
integration-test:
name: integration-test
tasks:
- task_key: "main"
spark_python_task:
python_file: ${workspace.file_path}/tests/entrypoint.py
libraries:
- whl: ../dist/*.whl
...
job_clusters:
- job_cluster_key: test-cluster
new_cluster:
...
spark_env_vars:
DIST_FOLDER_PATH: ${workspace.file_path}/dist
...
databricks.yml
...
artifacts:
default:
type: whl
path: .
...
and the entrypoint.py file
import os
import subprocess
import sys
if __name__ == "__main__":
# no bytecode io
sys.dont_write_bytecode = True
# install extra dependencies, workaround for https://github.com/databricks/cli/issues/1602
dist_folder: str = os.environ.get("DIST_FOLDER_PATH")
if dist_folder is None:
raise KeyError(
"The env variable DIST_FOLDER_PATH is not set but is needed to run the tests."
)
wheel_files = [os.path.join(dist_folder, f) for f in os.listdir(dist_folder) if f.endswith("whl")]
for wheel_file in wheel_files:
subprocess.check_call(
[sys.executable, "-m", "pip", "install", f"{wheel_file}[test]"]
)
...
Describe the issue
When packaging a wheel in Python it's standard practice to put some libraries in extras groups. This is commonly used in GPU/ML experimentation repositories to scope dependency groups for specific use-cases or workflows.
When attempting to specify an extras group in the libraries config for a DABs project the bundle build throws an error:
Hoping that this can be resolved! The only possible workarounds are of now are very destructive to standard python packaging workflows:
If there are other/better workarounds, I'd love to hear them!
Configuration (shortened for brevity)
In
pyproject.toml
:In
databricks.yml
(shortened for brevity):Steps to reproduce the behavior
databricks bundle deploy
Expected Behavior
Instead of attempting to find a local file
./dist/*.whl[train]
the bundle should correctly identify that[train]
is an extras group and install the extras appropriately. This is standard behavior in python wheels.Actual Behavior
Bundle build fails because the wheel file can't be found.
OS and CLI version
OS X, Databricks CLI v0.219.0
Is this a regression?
No