Closed msetkin closed 1 year ago
bump - having same issue
After debugging and having a few discussions with the Support of Databricks I got the following findings:
install_requires
parameter inside the setup() call, only the extras_require
. The idea behind this was inspired by many sources, for example, here, and it was to be as flexible in choosing which packages to install as we can, in the extreme, when no extras are passed to the pip, not having to install any additional libraries.install_requires
. After rearranging the structure of setup.py in this way this error is gone, therefore my takeaways would be:
install_requires
should contain only the libraries that are absent in the Databricks Runtime (for example, this is the list of the libraries, that are present for 13.3, so anything that is needed for the app but not mentioned in the link above, plus there is no need to specify here pyspark and delta-spark, because they are present on DBR anyhow although not mentioned in the link above). The installation on Databricks side should not contain optional/extra "identifiers"extras_require
may be structured in a more detailed view, in particular, for local development, testing, linting, build tools etc. In particular, pyspark/delta-spark should also be in one of the optional extras, which are in the extras_require
, and not in the install_requires
, since they are installed in DBR (although not mentioned directly). The local installation may include any optional "extras", i.e. pip install -e .[local_dev,build,linting,testing]
or combinations, etc.Having said this, I think I may now close the bug as the above mentioned workaround is found and working. The only takeaway for Databricks may be, they might want to unify the magic pip (%pip) interface so that it fully support the specification of the "pip" itself, including the support of optional extras. But this could be done as a separate issue and not critical for us as the moment since we got a workaround.
Expected Behavior
I expect that when I have a Python package with a "setup.py" file to manage dependencies and to specify an entry point function, and if I deploy this entry point function as a python_wheel_task with DBX, it should run without imports errors of libraries, that are specifies in setup.py.
Current Behavior
My Python package has a "setup.py" file to manage dependencies and to specify an entry point function, which I use to deploy a pipeline as a python_wheel_task with DBX. But when I run the pipeline as a python_wheel_task, it causes an exception:
ModuleNotFoundError: No module named 'mypy_boto3_s3'
although mypy-boto3-s3 dependency is provided in the setup.py. However, when I deploy the same logics as a notebook, where I install my package, import it and call the entry function, it finishes without this error.Steps to Reproduce (for bugs)
environments: development: workflows:
-- coding: utf-8 --
from setuptools import find_packages, setup
PACKAGE_REQUIREMENTS = [ "delta-spark==2.4.0", "pyspark==3.4.0", "mypy-boto3-s3==1.28.36", ]
setup( name="some_package", version="0.1.0", author="", author_email="m@.com", python_requires="==3.10.", description="**", url="", packages=find_packages( include=["some_package", "some_package."], exclude=["tests", "tests.", "workflows", "notebooks"], ), exclude_package_data={"": ["workflows/", "notebooks/"]}, setup_requires=["setuptools", "wheel"], extras_require={ "base": PACKAGE_REQUIREMENTS }, entry_points={ "console_scripts": [ "some_entry_point = some_package.some_module.common:entry_point", ], }, )
ModuleNotFoundError: No module named 'mypy_boto3_s3'
ModuleNotFoundError Traceback (most recent call last) File ~/.ipykernel/7667/command--1-2343336596:18 15 entry = [ep for ep in metadata.distribution("some_package").entry_points if ep.name == "some_entry_point"] 16 if entry: 17 # Load and execute the entrypoint, assumes no parameters ---> 18 entry[0].load()() 19 else: 20 import dataprep_helper
File /usr/lib/python3.10/importlib/metadata/init.py:171, in EntryPoint.load(self) 166 """Load the entry point from its definition. If only a module 167 is indicated by the value, return that module. Otherwise, 168 return the named object. 169 """ 170 match = self.pattern.match(self.value) --> 171 module = import_module(match.group('module')) 172 attrs = filter(None, (match.group('attr') or '').split('.')) 173 return functools.reduce(getattr, attrs, module)
File /usr/lib/python3.10/importlib/init.py:126, in import_module(name, package) 124 break 125 level += 1 --> 126 return _bootstrap._gcd_import(name[level:], package, level)
File:1050, in _gcd_import(name, package, level)
File:1027, in _find_andload(name, import)
File:1006, in _find_and_loadunlocked(name, import)
File:688, in _load_unlocked(spec)
File:883, in exec_module(self, module)
File:241, in _call_with_frames_removed(f, *args, **kwds)
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/some_package/some_module/common.py:13 11 import boto3 12 from botocore.exceptions import ClientError, ParamValidationError ---> 13 from mypy_boto3_s3 import S3Client 14 from pyspark.sql import DataFrame, SparkSession
ModuleNotFoundError: No module named 'mypy_boto3_s3'