dask / dask-expr

BSD 3-Clause "New" or "Revised" License
79 stars 18 forks source link

Incomplete list of dependencies #1079

Closed penguinpee closed 2 weeks ago

penguinpee commented 2 weeks ago

Describe the issue:

The list of packages dask-expr depnds on is incomplete. It requires distributed and crick. Neither of which is listed as a dependency, nor is it a dependency of a dependency. For dask it's an optional dependency and part of dask[distributed].

Minimal Complete Verifiable Example:

n/a

Anything else we need to know?:

This came up during packaging for Fedora, where we use the available packages from the repositories and rely on the package's specification of dependencies. Running the tests makes clear distributed is required. Though, the tests can be excluded, it is imported in the code as well and not guarded.

By way of smoke test, we also import all modules of a package. That brought to light the dependency on crick.

Environment:

phofl commented 2 weeks ago

Can you elaborate a little bit here?

neither crick nor distributed are required dependencies to run the Dataframe API. Both are required for some parts, but this hasn't changed.

We didn't properly escape tests, but I'll put up a PR shortly for this

phofl commented 2 weeks ago

closed by #1081

penguinpee commented 2 weeks ago

neither crick nor distributed are required dependencies to run the Dataframe API. Both are required for some parts, but this hasn't changed.

If they are required for some part, but not listed as (optional) dependency, touching that part will throw a ModuleNotFoundError. That should not happen.

Output when running `pytest` without `distributed` (in a venv) ``` ==================================================================================================================================== ERRORS ==================================================================================================================================== _____________________________________________________________________________________________________________ ERROR collecting dask_expr/io/tests/test_parquet.py ______________________________________________________________________________________________________________ ImportError while importing test module '/home/sandro/devel/dask-expr/dask_expr/io/tests/test_parquet.py'. Hint: make sure your test modules/packages have valid Python names. Traceback: /usr/lib/python3.12/site-packages/_pytest/python.py:617: in _importtestmodule mod = import_path(self.path, mode=importmode, root=self.config.rootpath) /usr/lib/python3.12/site-packages/_pytest/pathlib.py:564: in import_path importlib.import_module(module_name) /usr/lib64/python3.12/importlib/__init__.py:90: in import_module return _bootstrap._gcd_import(name[level:], package, level) :1387: in _gcd_import ??? :1360: in _find_and_load ??? :1331: in _find_and_load_unlocked ??? :935: in _load_unlocked ??? /usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module exec(co, module.__dict__) dask_expr/io/tests/test_parquet.py:10: in from distributed.utils_test import gen_cluster E ModuleNotFoundError: No module named 'distributed' _____________________________________________________________________________________________________________ ERROR collecting dask_expr/tests/test_diagnostics.py _____________________________________________________________________________________________________________ ImportError while importing test module '/home/sandro/devel/dask-expr/dask_expr/tests/test_diagnostics.py'. Hint: make sure your test modules/packages have valid Python names. Traceback: /usr/lib/python3.12/site-packages/_pytest/python.py:617: in _importtestmodule mod = import_path(self.path, mode=importmode, root=self.config.rootpath) /usr/lib/python3.12/site-packages/_pytest/pathlib.py:564: in import_path importlib.import_module(module_name) /usr/lib64/python3.12/importlib/__init__.py:90: in import_module return _bootstrap._gcd_import(name[level:], package, level) :1387: in _gcd_import ??? :1360: in _find_and_load ??? :1331: in _find_and_load_unlocked ??? :935: in _load_unlocked ??? /usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module exec(co, module.__dict__) dask_expr/tests/test_diagnostics.py:6: in from distributed.utils_test import * # noqa E ModuleNotFoundError: No module named 'distributed' =========================================================================================================================== short test summary info ============================================================================================================================ SKIPPED [1] dask_expr/io/tests/test_distributed.py:7: could not import 'distributed': No module named 'distributed' SKIPPED [1] dask_expr/tests/test_distributed.py:10: could not import 'distributed': No module named 'distributed' ERROR dask_expr/io/tests/test_parquet.py ERROR dask_expr/tests/test_diagnostics.py !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 2 errors during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ========================================================================================================================= 2 skipped, 2 errors in 2.34s ========================================================================================================================= ```

Regarding crick, importing diagnostics fails with ModuleNotFoundError. If it's truly optional, the import should not fail.

Error when importing `diagnostics` ```python Check import: dask_expr.diagnostics Traceback (most recent call last): File "/usr/lib/rpm/redhat/import_all_modules.py", line 171, in main() ~~~~^^ File "/usr/lib/rpm/redhat/import_all_modules.py", line 167, in main import_modules(modules) ~~~~~~~~~~~~~~^^^^^^^^^ File "/usr/lib/rpm/redhat/import_all_modules.py", line 100, in import_modules importlib.import_module(module) ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^ File "/usr/lib64/python3.13/importlib/__init__.py", line 88, in import_module return _bootstrap._gcd_import(name[level:], package, level) ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "", line 1387, in _gcd_import File "", line 1360, in _find_and_load File "", line 1331, in _find_and_load_unlocked File "", line 935, in _load_unlocked File "", line 1022, in exec_module File "", line 488, in _call_with_frames_removed File "/builddir/build/BUILD/python-dask-expr-1.1.3-build/BUILDROOT/usr/lib/python3.13/site-packages/dask_expr/diagnostics/__init__.py", line 1, in from dask_expr.diagnostics._analyze import analyze File "/builddir/build/BUILD/python-dask-expr-1.1.3-build/BUILDROOT/usr/lib/python3.13/site-packages/dask_expr/diagnostics/_analyze.py", line 11, in from dask_expr.diagnostics._analyze_plugin import ( ...<4 lines>... ) File "/builddir/build/BUILD/python-dask-expr-1.1.3-build/BUILDROOT/usr/lib/python3.13/site-packages/dask_expr/diagnostics/_analyze_plugin.py", line 4, in from crick import TDigest ModuleNotFoundError: No module named 'crick' ``` Thanks for the patch. I'll test drive it shortly.
phofl commented 2 weeks ago

I listed them as optional in the PR

penguinpee commented 2 weeks ago

Thanks! I still see one test failing:

FAILED dask_expr/tests/test_shuffle.py::test_respect_context_shuffle[shuffle] - ModuleNotFoundError: No module named 'distributed'

With crick present, importing diagnostics fails on distributed. So, how optional are those? Is diagnostics only to be used internally or in special cases?

By the way, we still intend to package crick and distributed as well. It just came as a surprise that those were needed, or, put another way, that intervention was needed to get it to pass testing.

phofl commented 2 weeks ago

They are only used if you want to analyze your query and not recommended for anything that runs in production. It's just an interactive tool that isn't needed for actually running stuff

On Wed, Jun 19, 2024 at 6:13 PM Sandro @.***> wrote:

Thanks! I still see one test failing:

FAILED dask_expr/tests/test_shuffle.py::test_respect_context_shuffle[shuffle] - ModuleNotFoundError: No module named 'distributed'

With crick present, importing diagnostics fails on distributed. So, how optional are those? Is diagnostics only to be used internally or in special cases?

By the way, we still intend to package crick and distributed as well. It just came as a surprise that those were needed, or, put another way, that intervention was needed to get it to pass testing.

— Reply to this email directly, view it on GitHub https://github.com/dask/dask-expr/issues/1079#issuecomment-2179060821, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOYQZGCIEAS57LO3NIXVAYLZIGU3FAVCNFSM6AAAAABJQ2KZ7SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZZGA3DAOBSGE . You are receiving this because you modified the open/close state.Message ID: @.***>

penguinpee commented 2 weeks ago

We should be okay shipping without those then for the time being. Thanks for the clarification.

phofl commented 2 weeks ago

FWIW you should definitely have distributed for performance and all kinds of deployment reasons, crick is not very important

On Wed, Jun 19, 2024 at 6:29 PM Sandro @.***> wrote:

We should be okay shipping without those then for the time being. Thanks for the clarification.

— Reply to this email directly, view it on GitHub https://github.com/dask/dask-expr/issues/1079#issuecomment-2179090807, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOYQZGF4NQVKSOF3HRWA3QDZIGWXJAVCNFSM6AAAAABJQ2KZ7SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZZGA4TAOBQG4 . You are receiving this because you modified the open/close state.Message ID: @.***>

penguinpee commented 2 weeks ago

It's being worked on. It will come to Fedora, eventually. I packaged crick myself. It's awaiting review now.