databrickslabs / dbx

🧱 Databricks CLI eXtensions - aka dbx is a CLI tool for development and advanced Databricks workflows management.
https://dbx.readthedocs.io
Other
443 stars 122 forks source link

Python dependencies are missing (ModuleNotFoundError) when a pipeline is installed using DBX as python_wheel_task #850

Closed msetkin closed 1 year ago

msetkin commented 1 year ago

Expected Behavior

I expect that when I have a Python package with a "setup.py" file to manage dependencies and to specify an entry point function, and if I deploy this entry point function as a python_wheel_task with DBX, it should run without imports errors of libraries, that are specifies in setup.py.

Current Behavior

My Python package has a "setup.py" file to manage dependencies and to specify an entry point function, which I use to deploy a pipeline as a python_wheel_task with DBX. But when I run the pipeline as a python_wheel_task, it causes an exception: ModuleNotFoundError: No module named 'mypy_boto3_s3' although mypy-boto3-s3 dependency is provided in the setup.py. However, when I deploy the same logics as a notebook, where I install my package, import it and call the entry function, it finishes without this error.

Steps to Reproduce (for bugs)

  1. deployment.yml:
    
    build:
    python: "pip"

environments: development: workflows:

PACKAGE_REQUIREMENTS = [ "delta-spark==2.4.0", "pyspark==3.4.0", "mypy-boto3-s3==1.28.36", ]

setup( name="some_package", version="0.1.0", author="", author_email="m@.com", python_requires="==3.10.", description="**", url="", packages=find_packages( include=["some_package", "some_package."], exclude=["tests", "tests.", "workflows", "notebooks"], ), exclude_package_data={"": ["workflows/", "notebooks/"]}, setup_requires=["setuptools", "wheel"], extras_require={ "base": PACKAGE_REQUIREMENTS }, entry_points={ "console_scripts": [ "some_entry_point = some_package.some_module.common:entry_point", ], }, )


3. Deployment:
`dbx deploy job-name --environment=development --deployment-file=./conf/deployment.yml`

4. Result of running:

ModuleNotFoundError: No module named 'mypy_boto3_s3'

ModuleNotFoundError Traceback (most recent call last) File ~/.ipykernel/7667/command--1-2343336596:18 15 entry = [ep for ep in metadata.distribution("some_package").entry_points if ep.name == "some_entry_point"] 16 if entry: 17 # Load and execute the entrypoint, assumes no parameters ---> 18 entry[0].load()() 19 else: 20 import dataprep_helper

File /usr/lib/python3.10/importlib/metadata/init.py:171, in EntryPoint.load(self) 166 """Load the entry point from its definition. If only a module 167 is indicated by the value, return that module. Otherwise, 168 return the named object. 169 """ 170 match = self.pattern.match(self.value) --> 171 module = import_module(match.group('module')) 172 attrs = filter(None, (match.group('attr') or '').split('.')) 173 return functools.reduce(getattr, attrs, module)

File /usr/lib/python3.10/importlib/init.py:126, in import_module(name, package) 124 break 125 level += 1 --> 126 return _bootstrap._gcd_import(name[level:], package, level)

File :1050, in _gcd_import(name, package, level)

File :1027, in _find_andload(name, import)

File :1006, in _find_and_loadunlocked(name, import)

File :688, in _load_unlocked(spec)

File :883, in exec_module(self, module)

File :241, in _call_with_frames_removed(f, *args, **kwds)

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/some_package/some_module/common.py:13 11 import boto3 12 from botocore.exceptions import ClientError, ParamValidationError ---> 13 from mypy_boto3_s3 import S3Client 14 from pyspark.sql import DataFrame, SparkSession

ModuleNotFoundError: No module named 'mypy_boto3_s3'


My current assumption is, that the installation step is skipped due to some reason if deploying s a python_wheel_task a setup.py-structured package.

## Context

## Your Environment

* dbx version used: 0.8.18
* Databricks Runtime version: 13.3 LTS (includes Apache Spark 3.4.1, Scala 2.12)
zachcacciatore-8451 commented 1 year ago

bump - having same issue

msetkin commented 1 year ago

After debugging and having a few discussions with the Support of Databricks I got the following findings:

  1. As one can see the structure of our setup.py file, we are not using install_requires parameter inside the setup() call, only the extras_require. The idea behind this was inspired by many sources, for example, here, and it was to be as flexible in choosing which packages to install as we can, in the extreme, when no extras are passed to the pip, not having to install any additional libraries.
  2. While this approach works perfectly locally with the original "pip", it seems that the implementation of "magic pip" (%pip) or whatever installation procedure is running on Databricks side, is ignoring extras, and processes correctly only the libs that are enlisted in install_requires. After rearranging the structure of setup.py in this way this error is gone, therefore my takeaways would be:
    • install_requires should contain only the libraries that are absent in the Databricks Runtime (for example, this is the list of the libraries, that are present for 13.3, so anything that is needed for the app but not mentioned in the link above, plus there is no need to specify here pyspark and delta-spark, because they are present on DBR anyhow although not mentioned in the link above). The installation on Databricks side should not contain optional/extra "identifiers"
    • extras_require may be structured in a more detailed view, in particular, for local development, testing, linting, build tools etc. In particular, pyspark/delta-spark should also be in one of the optional extras, which are in the extras_require, and not in the install_requires, since they are installed in DBR (although not mentioned directly). The local installation may include any optional "extras", i.e. pip install -e .[local_dev,build,linting,testing] or combinations, etc.

Having said this, I think I may now close the bug as the above mentioned workaround is found and working. The only takeaway for Databricks may be, they might want to unify the magic pip (%pip) interface so that it fully support the specification of the "pip" itself, including the support of optional extras. But this could be done as a separate issue and not critical for us as the moment since we got a workaround.