databrickslabs / dbx

🧱 Databricks CLI eXtensions - aka dbx is a CLI tool for development and advanced Databricks workflows management.
https://dbx.readthedocs.io
Other
437 stars 119 forks source link

No Recommended Way to Specify Pinned Dependencies #483

Open pspeter opened 1 year ago

pspeter commented 1 year ago

Hello,

since support for requirements files was deprecated, there is AFAIK no way to support pinned dependencies anymore, except by putting them all into the deployment file. As I explained already in #424, I currently use pip-tools to pin my project dependencies and separately pin my test/lint/etc. dependencies. The current recommended way to separate test/lint/prod requirements is to use setuptool's or poetry's extras. However, these methods do not offer a way to create (and use) pinned dependencies separately from non-pinned dependencies.

However, it would be nice to be able to use pinned dependencies for reproducible builds. For pip-based projects those are usually found in requirements-files (sometimes multiple), and for poetry-based projects dependency groups might be more fitting for this purpose than using extras (which are actually meant for user-facing optional dependencies for optional features of libraries, however they are often (ab)used for test/dev requirements as well).

I know it is hard to support the many ways people are managing and building python packages, however, it would be nice to have at least one way where we can pin our dependencies separately from the unpinned ones.

Thanks for your work and the awesome new docs by the way 🔥

tonkolviktor commented 1 year ago

Hi @pspeter (I'm not from dbx), to solve this problem we came up with 2 non-dbx alternatives:

The first feels hacky but it's ~easy the second has more advantages, eg.: ~faster job start.

copdips commented 1 year ago

@pspeter , I'm not from databricks neither, but I would like to understand your question. Could you please provide a concret example that it doesn't work ? Because setup.py is supported by dbx, I still cannot figure out why you said you cannot use the pinned dependencies.

pspeter commented 1 year ago

Hi @pspeter (I'm not from dbx), to solve this problem we came up with 2 non-dbx alternatives:

* init_script

* custom docker image.

The first feels hacky but it's ~easy the second has more advantages, eg.: ~faster job start.

Hey @tonkolviktor, thanks for your reply. Could you expand a bit on the custom docker image? How do you deploy it, do you use dbx at all with it or is it a completely separate, custom solution?

@pspeter , I'm not from databricks neither, but I would like to understand your question. Could you please provide a concret example that it doesn't work ? Because setup.py is supported by dbx, I still cannot figure out why you said you cannot use the pinned dependencies.

@copdips, while you are right that you can pin your dependencies in setup.py, it is not common to do it there. It's more common to have both unpinned and pinned dependencies, and setup.py/pyproject.toml is the place for unpinned ones. For pinned, there's e.g. requirement files or poetry's lock files. This separation just makes it easy to upgrade your pinned versions with a single command like pip-compile --upgrade or poetry update. If you pin everything in setup.py, including all your sub-dependencies, I don't know a tool that would help you update all of these to the latest version.

copdips commented 1 year ago

Do you mean you want to use requirements files with setup.py? If yes, as setup.py is a python file, it can parse easily exsting requirements files.

pspeter commented 1 year ago

No, I want to have my unpinned requirements in setup.py (or pyproject.toml), and my pinned requirements in a separate requirements.txt. The pinned requirements are generated automatically from the unpinned ones. Please have a look at the pip-tools Readme for an overview of this workflow.

tonkolviktor commented 1 year ago

so we are still setting it up, but This is the official doku: https://docs.databricks.com/clusters/custom-containers.html We use CI to build and push the image (no dbx/databricks there), then we configure the job via dbx to use it: docker_image tag https://learn.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/jobs

copdips commented 1 year ago

No, I want to have my unpinned requirements in setup.py (or pyproject.toml), and my pinned requirements in a separate requirements.txt. The pinned requirements are generated automatically from the unpinned ones. Please have a look at the pip-tools Readme for an overview of this workflow.

Thx for the link, but I'm really sorry that still cannot figure out why it doesn't work for you, because for me, we can do almost anything in setup.py (not the case for pyproject.toml). Could you please provide an example ? A simplfied folder tree, the content of setup.py, the simplfied content of requirements.in and simplfied requirements.txt, etc. ?

pspeter commented 1 year ago

@copdips

install_requires=["pandas>=1.0"]

numpy==1.23.2

via pandas

pandas==1.2.5

via project-name

python-dateutil==2.8.2

via pandas

pytz==2022.2.1

via pandas

In CI pipeline, the job should be built using the pinned requirements in requirements.txt, but I can update them anytime using pip-compile --upgrade. This way, I can do controlled updates that are tested in our CI system with unit and integration tests. If you don't pin the dependencies, you might get different versions of packages between your tests and prod, which may break your code in unexpected ways.

But if you pin them directly in setup.py, it becomes hard to upgrade, you have to do everything by hand. You also might miss pinning transitive dependencies (e.g. numpy, python-dateutil and pytz in the example above).

copdips commented 1 year ago

If I understand well, you have 2 different places to declare the dependencies, right ? If so, and if the file requirements.txt is tracked by git, what about letting setup.py to read requirements.txt like this:

from pkg_resources import parse_requirements
from setuptools import setup

with open("requirements.txt", encoding="utf-8") as req_file:
    install_requires = [str(req) for req in parse_requirements(req_file)]

setup(
    ...
    install_requires=install_requires,
    ...
)
pspeter commented 1 year ago

I don't want setup.py to read requirements.txt, they are separate and serve different functions. Please re-read my above comments and the pip-compile docs.

copdips commented 1 year ago

ok, what about adding a if else block in setup.py, prefer single entrypoint for dependencies management

if in ci:
    install_requires from requirements.txt
else:
    install_requires=["pandas>=1.0"]
pspeter commented 1 year ago

That should work I suppose, but it's a bit hacky. dbx should support a better/more commonly used solution imho.

renardeinside commented 1 year ago

hi @pspeter , I've a brief outline of this concept. Idea - what if we introduce a new build functionality, e.g. dbx build that will have the following behaviour:

dbx build --groups=dev,test 
# OR
dbx build --extras=dev,test

It reference only the base + requested groups or packages in the resulting whl file? Would this improve your devExp? What kind of options you might consider for such a command?

pspeter commented 1 year ago

Hi @renardeinside! Thanks for your response.

I struggle to understand what the build command would do other than what dbx deploy already does. What I am looking for is a way to use pinned dependencies in the deployment. Traditionally, these are not included in setup.py at all but in a separate file (requirements.txt for pip and poetry.lock for poetry). I'll give you an example why this is needed:

Say my job uses fancy-ml-package v1.5 and I am happy with it. But the developers of that package release a new version v2.0 with new features and some breaking changes. This causes my code to break and I have to make some changes before I update to 2.0. If I just specify "fancy-ml-package" as requirement in my setup.py, my CI will suddenly freak out with failed tests once 2.0 is released.

An easy fix would be to specify "fancy-ml-package<2.0,>=1.5" in my setup.py. However, my code might even break when a dependency of fancy-ml-package gets updated, one that I don't even use in my code. For example, if fancy-ml-package uses fancy-data-package, an update 2.0 to the latter might break the former, which in turn might break my pipeline.

Now to fix this, I would also have to add fancy-data-package<2.0 to my setup.py, even though I don't even use that package myself!

To fix this problem once and for all, you can instead pin your dependencies and sub-dependencies to specific versions. pip-compile and poetry install do exactly that. But, they don't overwrite your dependencies in setup.py. Therefore, it is easy to later update your pinned dependencies to the latest versions in a controlled fashion (i.e. in a PR with CI/CD) just by calling pip-compile --upgrade or poetry update.

However, a tool like dbx needs to be able to use these pinned dependencies when installing the package. I can't tell you how to do that since I'm not very familiar with dbx's code base. But right now in my projects, I have to read the requirements file in my setup.py to get dbx to use them, which is not convenient.

So to wrap up, instead of dbx build, I would like dbx deploy to use my requirements-*.txt files or my poetry.lock files. Ideally, I want to be able to specify a list of these files in the deployment.yml in the python_wheel_task section so I don't have to repeat it in every CLI command. It usually wouldn't change for a given job anyway.

If you have further questions, just ask! Thanks again for your work!

spott commented 1 year ago

This feature would be great to have. Supporting poetry lockfiles seems like the most logical way to do this (looking at https://dbx.readthedocs.io/en/latest/guides/general/dependency_management/, I was under the impression this was already being done by your reference to poetry), however if you wanted to go with supporting pipenv lock files, requirements.txt or even nix flakes that would work.

The value of this kind of dependency pinning is that it allows you to be much more confident that if it works in test it will work in production, as the environment is exactly* the same.

As an example of why this is important, pandas 1.5.1 changed the behavior of groupby on categorical dtypes from what it was in 1.5.0 (which was different from what it was in 1.4.x). A pandas install on a system that already has 1.5.0 is going to use that version, while in production on a system that is built from scratch, the version might end up being 1.5.1. Thus, this change could introduce a (very hard to track down) bug that only shows up in production.

Poetry, pip-tools, Pipenv and nix attempt to fix this problem by introducing the concept of lock files. This is a file that is separate from the requirements given for the package that represents the exact package used for every python dependency installed in that environment. It typically includes not just specific pinned versions of each dependency and sub-dependency, but hashes of those packages as well. This means when I want to run my python code elsewhere, I can use this precise definition of the python versions that should be installed in the environment to be sure that the code behaves the same. This separation between lock files and package requirements (like in setup.py) allows me to update the lock file with new versions of packages when they come out (and then retest and check that everything works).

Libraries (rather than applications) can't do this kind of version pinning, as it would completely prevent dependency resolution of anything that depended on the library. This is why setup.py and pyproject.yaml typically don't have pinned dependencies, as they are tools for the distribution of libraries, not applications. The robust way to deal with this for libraries is to use something like tox to run tests on every combination of dependency versions available to ensure that the package still works. In practice this is impossible for any reasonably sized package, so shortcuts are taken, but still some combination needs to be checked. Applications however can take advantage of this version pinning to ensure that they are run with a known set of dependencies to prevent surprises in production.

Ideally this behavior would be possible with Databricks notebooks -- so notebook scoped python dependencies could ensure that every time that notebook is run, from now till forever, it behaves the same. Think about how this would fix scheduled jobs that have dependencies that change between runs: you run the job January of this year, and then again January of next year. In that time, it is likely that every python dependency that job depends on has changed and it is unlikely it will work exactly the same without modification. With lock files, it would have a much better chance of working without being edited.

* python lockfiles only fix python versions, underlying c libraries (such as lapack or bias libraries) can change and thus change the environment... nix attempts to fix that by locking not just python packages but everything in the environment down to the compiler.

renardeinside commented 1 year ago

Thanks a lot for your opinions @spott and @pspeter !

I think there are two separate problems that should be solved separately:

  1. Local environment management and sync
  2. Package management and sync.

I.e. issues with C-based libraries and JVM are local env management parts, whilst the package management (e.g. adding new dependencies without breaking the existing package) is related but not dependent.

As for local environment management, I assume the best is to introduce a Docker dev environment based on the Databricks runtime containers. This will allow users to operate in a portable environment that is not dependent on Win/Mac/Linux issues, and the only local requirement is just to have docker and an IDE that supports it. Most popular IDEs (e.g. VS Code and Intellij products) fully support the devcontainer based approach.

Therefore this is a change in the default template, not a change in the dbx functionality.

As for (2), things a getting slightly harder since in the dbx project we indeed won't be able to implement a fully-functional Python dependency resolution system. Instead, we can "stand on the giant shoulders" and use poetry, a very effective tool for that. Now to the topic of dbx build and how it affects the user interface.

The biggiest issue with the current dbx approach is that the prepared whl package only supports installing extras (i.e. package_name[dev,local] as a part of dbx execute. However in some cases, e.g. unit-tests, it makes sense to be able to perform the same installation during automated launch. Issue is that Library interface for Job clusters doesn't really support providing extras, e.g. something like:

"libraries": [
  {
    "whl": "dbfs:/some/versioned/path/dbx_azure_devops_demo-0.0.1-py3-none-any.whl[EXTRA1,EXTRA2]"
  }
]

won't work. Instead, we can introduce a new building logic that allows to build a package with custom extras choice (or management groups as they're now called in poetry).

by using this approach you'll be able to specify the dependencies in a custom way, e.g. separate dependency stack for dbx execute, separate stack for dbx deploy && dbx launch for integration tests, separate stack for prod launch.

Does the described above makes sense, or you think this will only complicate the setup?

spott commented 1 year ago

Docker is actually a fantastic solution, but maybe not in the way you are thinking.

A docker build for local and remote execution would allow my local environment and my production environment to be identical, which is perfect. A docker build for local development only doesn't solve the core problem of dependency resolution.

From what it sounds like, Databricks doesn't have a method of installing a set of specific package and versions for a job: wheels are provided and then pip is used to resolve the dependencies of those wheels from what is given by the wheels metadata/setup.py.

Unfortunately, within this framework, ensuring that Databricks is using pinned dependencies is likely impossible (fixing dependencies by pinning them in the setup.py probably will break something unless you are installing the package in a brand new empty virtualenv), so this might be a problem for Databricks upstream.

The extras solution is a hack to try and deal with this, and it likely won't work. Generally speaking, you can't have dependencies change when extras are included (you can add dependencies, but not do something like pandas==1.5.1 under regular dependencies and pandas==1.5.0 under an extra dependency). This is also complicated by the fact that management groups in poetry are not the same as extras. They are separate concepts. Management groups are a poetry specific way of separating out dependencies by topic, however in order to do dependency resolution, poetry takes a superset of all management groups into account (so if you have a dev and a prod management group, all packages in both are included when the package is locked, and it will fail to lock if they are conflicting). Extras is a package extra that pip understands when doing installations (pip install my package[myextra] for example). I believe that extras also must all be resolvable at the same time as far as poetry is concerned: when dependency locking is done, poetry takes the superset that includes all dependencies in all management groups and all extras and finds what versions of all those dependencies work together. So you can't set the "production" management group and have that define a pinned set of packages, while the "development" management group doesn't pin those dependencies.

Docker might be the best solution right now if we can upload the docker file that we are using locally. So I can create a docker file based on one of the DBR dockerfiles and use poetry to install all dependencies when I install my package. Though I think this is going to run into issues as Databricks isn't currently setup to run an application in docker...

Something like this might work: https://bneijt.nl/pr/poetry-lock-package/. It allows you to build the wheel for distribution with locked dependencies, while you rely on the lock file while developing. The trick here is that this only works reliably if it is installed within an empty virtualenv. If the virtualenv has some other dependencies already added, then the pip install is likely to fail. It might need to be played with to ensure it doesn't conflict with Databricks DBR packages.

pspeter commented 1 year ago

Hey @renardeinside,

Ad 1): I agree that a docker environment is probably the ideal way to set up a robust Databricks dev environment. To be honest, I have little experience with docker dev environments, but I would be up to help out as much as I can (at least with testing).

And for 2), I'm not opposed to switching to poetry if that is what dbx chooses to support. I think it's a sensible option from today's standpoint. I'm still not sure about dbx build though. Maybe you can give me an example of how it fits into the typical dev workflow? For me, I currently have the following in mind (sorry for the lengthy example, it's as small as I could get it):

1) Create a project with poetry init/new and dbx init and add your dependencies with poetry add, which creates a poetry.lock file (using the current python template). 2) Add optional dependency groups for test/lint/local along with their dependencies

Now, the pyproject.toml looks something like this (ignoring delta and so on for brevity): ```toml [tool.poetry] name = "poetry-test" version = "0.1.0" description = "" authors = [] readme = "README.md" packages = [{include = "poetry_test"}] [tool.poetry.dependencies] python = "^3.9" numpy = "^1.23.3" [tool.poetry.group.test] optional = true [tool.poetry.group.test.dependencies] pytest = "^7.1.3" [tool.poetry.group.local] optional = true [tool.poetry.group.local.dependencies] pyspark = "^3.3.0" [tool.poetry.group.lint] optional = true [tool.poetry.group.lint.dependencies] black = "^22.10.0" [build-system] requires = ["poetry-core"] build-backend = "poetry.core.masonry.api" ```
and `poetry.lock` like this: ```toml [[package]] name = "attrs" version = "22.1.0" description = "Classes Without Boilerplate" category = "dev" optional = false python-versions = ">=3.5" [package.extras] dev = ["cloudpickle", "coverage[toml] (>=5.0.2)", "furo", "hypothesis", "mypy (>=0.900,!=0.940)", "pre-commit", "pympler", "pytest (>=4.3.0)", "pytest-mypy-plugins", "sphinx", "sphinx-notfound-page", "zope.interface"] docs = ["furo", "sphinx", "sphinx-notfound-page", "zope.interface"] tests = ["cloudpickle", "coverage[toml] (>=5.0.2)", "hypothesis", "mypy (>=0.900,!=0.940)", "pympler", "pytest (>=4.3.0)", "pytest-mypy-plugins", "zope.interface"] tests_no_zope = ["cloudpickle", "coverage[toml] (>=5.0.2)", "hypothesis", "mypy (>=0.900,!=0.940)", "pympler", "pytest (>=4.3.0)", "pytest-mypy-plugins"] [[package]] name = "black" version = "22.10.0" description = "The uncompromising code formatter." category = "dev" optional = false python-versions = ">=3.7" [package.dependencies] click = ">=8.0.0" mypy-extensions = ">=0.4.3" pathspec = ">=0.9.0" platformdirs = ">=2" tomli = {version = ">=1.1.0", markers = "python_full_version < \"3.11.0a7\""} typing-extensions = {version = ">=3.10.0.0", markers = "python_version < \"3.10\""} [package.extras] colorama = ["colorama (>=0.4.3)"] d = ["aiohttp (>=3.7.4)"] jupyter = ["ipython (>=7.8.0)", "tokenize-rt (>=3.2.0)"] uvloop = ["uvloop (>=0.15.2)"] [[package]] name = "click" version = "8.1.3" description = "Composable command line interface toolkit" category = "dev" optional = false python-versions = ">=3.7" [package.dependencies] colorama = {version = "*", markers = "platform_system == \"Windows\""} [[package]] name = "colorama" version = "0.4.5" description = "Cross-platform colored terminal text." category = "dev" optional = false python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*" [[package]] name = "iniconfig" version = "1.1.1" description = "iniconfig: brain-dead simple config-ini parsing" category = "dev" optional = false python-versions = "*" [[package]] name = "mypy-extensions" version = "0.4.3" description = "Experimental type system extensions for programs checked with the mypy typechecker." category = "dev" optional = false python-versions = "*" [[package]] name = "numpy" version = "1.23.3" description = "NumPy is the fundamental package for array computing with Python." category = "main" optional = false python-versions = ">=3.8" [[package]] name = "packaging" version = "21.3" description = "Core utilities for Python packages" category = "dev" optional = false python-versions = ">=3.6" [package.dependencies] pyparsing = ">=2.0.2,<3.0.5 || >3.0.5" [[package]] name = "pathspec" version = "0.10.2" description = "Utility library for gitignore style pattern matching of file paths." category = "dev" optional = false python-versions = ">=3.7" [[package]] name = "platformdirs" version = "2.5.4" description = "A small Python package for determining appropriate platform-specific dirs, e.g. a \"user data dir\"." category = "dev" optional = false python-versions = ">=3.7" [package.extras] docs = ["furo (>=2022.9.29)", "proselint (>=0.13)", "sphinx (>=5.3)", "sphinx-autodoc-typehints (>=1.19.4)"] test = ["appdirs (==1.4.4)", "pytest (>=7.2)", "pytest-cov (>=4)", "pytest-mock (>=3.10)"] [[package]] name = "pluggy" version = "1.0.0" description = "plugin and hook calling mechanisms for python" category = "dev" optional = false python-versions = ">=3.6" [package.extras] dev = ["pre-commit", "tox"] testing = ["pytest", "pytest-benchmark"] [[package]] name = "py" version = "1.11.0" description = "library with cross-python path, ini-parsing, io, code, log facilities" category = "dev" optional = false python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*" [[package]] name = "py4j" version = "0.10.9.5" description = "Enables Python programs to dynamically access arbitrary Java objects" category = "dev" optional = false python-versions = "*" [[package]] name = "pyparsing" version = "3.0.9" description = "pyparsing module - Classes and methods to define and execute parsing grammars" category = "dev" optional = false python-versions = ">=3.6.8" [package.extras] diagrams = ["jinja2", "railroad-diagrams"] [[package]] name = "pyspark" version = "3.3.0" description = "Apache Spark Python API" category = "dev" optional = false python-versions = ">=3.7" [package.dependencies] py4j = "0.10.9.5" [package.extras] ml = ["numpy (>=1.15)"] mllib = ["numpy (>=1.15)"] pandas_on_spark = ["numpy (>=1.15)", "pandas (>=1.0.5)", "pyarrow (>=1.0.0)"] sql = ["pandas (>=1.0.5)", "pyarrow (>=1.0.0)"] [[package]] name = "pytest" version = "7.1.3" description = "pytest: simple powerful testing with Python" category = "dev" optional = false python-versions = ">=3.7" [package.dependencies] attrs = ">=19.2.0" colorama = {version = "*", markers = "sys_platform == \"win32\""} iniconfig = "*" packaging = "*" pluggy = ">=0.12,<2.0" py = ">=1.8.2" tomli = ">=1.0.0" [package.extras] testing = ["argcomplete", "hypothesis (>=3.56)", "mock", "nose", "pygments (>=2.7.2)", "requests", "xmlschema"] [[package]] name = "tomli" version = "2.0.1" description = "A lil' TOML parser" category = "dev" optional = false python-versions = ">=3.7" [[package]] name = "typing-extensions" version = "4.4.0" description = "Backported and Experimental Type Hints for Python 3.7+" category = "dev" optional = false python-versions = ">=3.7" [metadata] lock-version = "1.1" python-versions = "^3.9" content-hash = "18e55cc17fb37078f7fa94b834d863f49336ccb747c87286ba9bf7b0eef44dac" [metadata.files] attrs = [ {file = "attrs-22.1.0-py2.py3-none-any.whl", hash = "sha256:86efa402f67bf2df34f51a335487cf46b1ec130d02b8d39fd248abfd30da551c"}, {file = "attrs-22.1.0.tar.gz", hash = "sha256:29adc2665447e5191d0e7c568fde78b21f9672d344281d0c6e1ab085429b22b6"}, ] [... SHAs cut out for brevity ...] ```

3) In the deployment.yml, create a python_wheel_task for the main job, and a spark_python_task for the integration test, including the necessary group(s) as a new argument:

custom:
  some-cluster-config: &some-cluster-config
    # omitted

environments:
  default:
    strict_path_adjustment_policy: true
    workflows:
      - name: "poetry-test-job"  # main entry point
        <<: *some-cluster-config  # omitted
        python_wheel_task:
          package_name: "poetry_test"
          entry_point: "poetry_test"
          with_dependency_groups: []  # default is empty list, just installs main group
          parameters: [ "foo" ]
      - name: "poetry-test-integration-test"
        tasks:
          - task_key: "main"
            <<: *some-cluster-config  # omitted
            spark_python_task:
              python_file: "file://tests/entrypoint.py"
              with_dependency_groups: ["test"]
              parameters: [
                "file:fuse://tests/integration"
              ]

4) Running dbx execute or dbx deploy/launch: dbx will look at pyproject.toml and poetry.lock, gather all dependencies that match the main group or the given dependency groups along with their versions from there, and add them to the API call as libraries.

Now, I'm not sure if 4) is even possible, but it's probably not easy. You'd have to parse the lock-file, even taking care of environment markers like "sys_platform == \"win32\"" and ignore those that don't fit, and then generate a list of libraries for the API call. But that's the ideal solution in my mind, with minimal changes to the current dbx development workflow.

pspeter commented 1 year ago

@spott

From what it sounds like, Databricks doesn't have a method of installing a set of specific package and versions for a job: wheels are provided and then pip is used to resolve the dependencies of those wheels from what is given by the wheels metadata/setup.py.

I don't think that's correct. The Jobs API has the option to provide libraries for each task in tasks. That's for example how dbx deploy --requirements-file requirements.txt works at the moment (even though using --requirements-file is deprecated). This happens here in the code:

https://github.com/databrickslabs/dbx/blob/394a926d0fa45447b2ae5937860bcd549476d957/dbx/api/adjuster/adjuster.py#L68-L88

additional_libraries comes directly from the provided --requirements-file, parsed using pkg_resources by the RequirementsFileProcessor here:

https://github.com/databrickslabs/dbx/blob/main/dbx/commands/deploy.py#L119

@copdips I don't see how that is better than the current solution, where the requirements.txt libraries are added as described above, along with the package wheel itself. To be honest, I just don't understand why --requirements-file was deprecated in the first place. Yes, the file format is not really standardized, but in my opinion the current solution works well enough for "industry standard" use cases. The only gripe I have is that I can't add multiple files (but that's not a big deal) and that I can't specify them in the deployment.yaml for a given task.

copdips commented 1 year ago

By reading the doc, it gives the reason like: image

AFAIK, from pkg_resources import parse_requirements works very well to resolve requirements.txt.

The current --requirements-file adds the deps explicitly in the libraries key of the final job json definition, each deps is declared as a pypi package. And without this param, it's setup.py to tell Databricks to install them, this is the only difference I can observe.

First of all, if the requirements.txt has many packages inside, the libraries key would be big too. Secondly, as the applicaiton itself is packaged a wheel file, and added to the libraries key too, and setup.py has also the install_requires part that will install the packages implicitly. which means databricks will install twice the same packages, the time for the second install should be fast, but it's a little bit strange or not clean IHMO to delcare the dependencies twice in the job definition.

I don't think it's dbx's job to manage & resolve the dependencies. Maybe Databricks itself could implement a feature to add a requirements-file type as library, and dbx deploy will just upload it to dbfs in a versionned way like the wheel file. Fortunately most of the popular Python dep management tools can convert their lock files to the traditional requirements file. image

PS: I deleted one of my comments this morning about a dirty & hacky solution. It was about to use the same --requirements-file, instead of adding them explicitly into the librairies part, we can replace the install_requires in setup.py by the list resolved by parse_requirements from --requirements-file during dbx deploy time, but I thought it's too hacky, might need the ast module, and I've never tested, so I deleted it.

pspeter commented 1 year ago

which means databricks will install twice the same packages, the time for the second install should be fast, but it's a little bit strange or not clean IHMO to delcare the dependencies twice in the job definition.

This is very common actually and not strange at all. pip will just skip the requirements from setup.py since they are installed already, and just install the package itself. I've seen this in many places and done it myself before. How else would you install a package with pinned requirements?

First example I found on google (it's often used in Dockerfiles so I searched for that): https://pythonspeed.com/articles/multi-stage-docker-python/

Edit: Even if you uploaded the requirements.txt file directly, it would still have to install that before installing the library itself. There's nothing gained here. dbx just has to be careful that the requirements.txt dependencies are installed before the package itself, which it currently does just fine afaik.

copdips commented 1 year ago

Yes, what I uesed previsouly before using Databricks is to use the requirements.txt file just like what you showed in the given link.

But by design, Databricks doesn't support it directly, and I've never tested the docker solution in Databricks, that's why I raised the above idea to implement it as a new library type. By this way, there's no more parsing, Databricks (not dbx) just executes pip install -r /dbfs/path/to/requirements.txt during the job startup before the run.

The gain is that we can bring the param --requirements-file back into dbx deploy (de-deprecated precisely), which is what you requested initially. dbx deploy just adds it as a new requirements-file library without any parsing process.

spott commented 1 year ago

@pspeter: Oh yea, I forgot about that. Though under the hood, it isn't clear to me what Databricks is using for environment isolation: are they booting up a new venv for each task? (are they pre-populating the venv with the dbr python libraries?) or are they polluting the venv for the entire cluster? or are they using some sort of chroot system for isolation. Hopefully it is a new venv for each task.

For the problem of specifying dependencies when using dbx build, adding a flag to build a dependency locked poetry lockfile could be the easiest. This: https://bneijt.nl/pr/poetry-lock-package/ package seems like it does exactly that: takes a poetry project and builds a wheel that has locked dependencies specified as dependencies of the wheel. (I mentioned this in my last post, but it was a wall of text that I'm sure no-one read).

copdips commented 1 year ago

@spott Shared access mode cluster is isolated, but users cannot install anything. Non isolation shared access mode cluster is not isolated, users can install anything but everyone will see the changes, because everyone is mapped to root.

AFAIK, the true isolated environment is using the job cluster with the cluster pool, every run of the job is in a isolated environment (one or more dedicated nodes) from the pool, and users can install anything they want.

pspeter commented 1 year ago

@copdips That would be an option I suppose, I'm not sure if Databricks wants to change their API for that though. @renardeinside what are your thought about that option?

@spott I don't know what's happening behind the scenes for dbx execute, I only know they have some kind of "execution_context" that they control with this v1.2 API. I'm pretty sure the libraries are not installed for the whole cluster.

Thanks for linking poetry-lock-package (again), I did read your comment but forgot to properly look at it after starting my own wall of text 😁 Definitely seems like it's worth a closer look, but from briefly looking at it I have some concerns:

copdips commented 1 year ago

@spott, thanks for sharing the blog https://bneijt.nl/pr/poetry-lock-package/, its way of creating two wheels files, one for the project itself, and another for all the deps looks great, but I'm not sure if dbx will use it, as poetry-lock-package is not popular enough ... but I really like its idea.

Nevertheless, poetry or not, I would like dbx to keep the support of requirements.txt or setup.py at least, this is largily adopted (probably the largiest) in the Python communitry till today.

BTW, regarding the issue of filtering pyspark from requirments.txt as shown in the blog, I think we can workaround it by not filtering anything, instead, pin pyspark to the version preinstalled by the Databricks runtime.

spott commented 1 year ago

poetry-lock-package can actually create a single wheel with both the project and all the deps if needed (that is the default actually), there is an option for splitting them up if needed though.

I agree that poetry-lock-package isn't super popular... but it is also only 350 lines of code, so it might be possible to reproduce it internally at some point if that is decided.

Isn't setup.py supported already by DBX? Building a wheel in python uses either a setup.py file or a pyproject.toml one: so it should work unless dbx is building it's own wheel building routine.

The reason that pyspark is filtered is because apparently it isn't installed via pip on Databricks clusters... it is just added to the python path. So if you have pyspark in your requirements.txt or setup.py, it will just install a new version again... but this time one that doesn't work with Databricks spark implementation.

spott commented 1 year ago

@pspeter:

I missed the GPLv3 license, that might make it more difficult.

The whole package is 350 or so lines of code (and it looks like most of it is walking the parsed lockfile tree), so I don't think the task is too difficult. It might even be possible to use some of the internals of poetry-core to save on the parsing and tree walking... though depending on the internals of another project has its own risks.

pspeter commented 1 year ago

To continue the discussion and to add another example why this is needed:

Today I created a project with dbx that has one dependency. This dependency itself depends on pyspark (any version). I'm using the current dbx-template and databricks runtime 11.3, so pyspark==3.3.0 is added in the local dependency extras. However, due to the transitive dependency, installing the package still includes pyspark in the "normal" dependency tree. So now whenever I dbx execute this project (without the local extra!), it still installs the latest pyspark==3.3.1 on the cluster, even though I want it to use the pre-installed 3.3.0.

I don't see a way around this without using lock files that include all transitive dependencies.

pspeter commented 1 year ago

This issue has come up again for multiple people including us because urllib3 v2.0 was released, which is incompatible with the current version of databricks-cli: https://github.com/databricks/databricks-cli/issues/634

pspeter commented 1 year ago

Hey @renardeinside, do you have any plans to resolve this issue?

I tried the current (deprecated) --requirements-file option again today, and noticed that even that option does not really work. At first dbx does install the dependencies from the file, but after that it does a hard-coded pip install --force-reinstall <package>, which reinstalls all dependencies and "overwrites" the pinned versions with the latest version again 👀

renardeinside commented 1 year ago

hey @pspeter , I have a plan for this - I'll introduce a new plugin for poetry that will generate a .whl with a subset of dependencies mentioned there. That being said, I cannot fully contribute my time into this yet.

The recommended workaround is the following:

  1. for dev dependencies - keep them in a dev extra (or poetry -G dev). Install locally via pip install -e '.[dev]'.
  2. for test dependencies - on all-purpose clusters, install them separately via cluster libraries. On job clusters, provide them into libraries definition.
  3. for main (core) dependencies keep them as default dependencies, they'll be installed by default with %pip on both all-purpose and job clusters.

Unfortunately with dbx I cannot easily overcome limitations of existing packaging solutions for Python.

pspeter commented 1 year ago

Thanks for your response. I'm looking forward to the plugin!

pspeter commented 1 year ago

Hey @renardeinside, one more question:

What exactly is the reason for the --force-reinstall here? https://github.com/databrickslabs/dbx/blob/2b21c11a4607471aba4957591efd09f752a78404/dbx/api/context.py#L126-L131

Without this option, the current --requirements-file could work well enough for now, but the force-reinstall kind of defeats the purpose of having the requirements file. Any chance to remove that or make it optional? Thanks!

followingell commented 1 year ago

Hello,

since support for requirements files was deprecated,

Hey @pspeter,

Sorry to come back to this after so much time but:

pspeter commented 1 year ago

Hey @followingell, it says so when you look at the help text of the —requirements-file option in the help text or in the CLI reference for dbx execute and deploy, see here: https://dbx.readthedocs.io/en/latest/reference/cli/#dbx-execute

pspeter commented 11 months ago

I found a nice workaround using just setuptools (and pip-compile):

Setuptools nowadays support dynamic fields that get its contents from a file. Making use of this, create a requirements.in file with your non-pinned dependencies, and create a pinned requirements.txt using something like pip-tools. Then, use the generated requirements.txt via the file attribute.

[project]
name = "<name>"
dynamic = ["dependencies"]
# ...

[tool.setuptools.dynamic]
dependencies = { file = ["requirements.txt"] }

Bonus: If you want to ensure that you get the same versions of all packages that are pre-installed on a databricks cluster, you can run pip freeze > constraints.txt on a cluster, add -c constraints.txt to the requirements.in file and now pip-compile will pin all versions in a way that matches those on the cluster. Pyspark won't be included in the pip freeze though, you have to add that to the constraints.txt separately using something like this:

import pyspark
pyspark_version = ".".join(pyspark.__version__.split(".")[:3])

with open("constraints.txt", "a") as fp:
    fp.write(f"pyspark=={pyspark_version}\n")