dask / dask-expr

BSD 3-Clause "New" or "Revised" License
86 stars 27 forks source link

Check with Xarray #904

Open mrocklin opened 9 months ago

mrocklin commented 9 months ago

We'd like to make sure that Xarray works with the new version of dask dataframe

I took a quick look at the codebase, and it looks like they're using dd.from_dask_array, so everything should be ok. It might be worth checking with them explicitly though. cc @dcherian

dcherian commented 8 months ago

Our tests are failing since the env doesn't contain dask-expr

 ../../../micromamba/envs/xarray-tests/lib/python3.12/site-packages/dask/dataframe/__init__.py:22: in _dask_expr_enabled
    import dask_expr  # noqa: F401
E   ModuleNotFoundError: No module named 'dask_expr'

During handling of the above exception, another exception occurred:
xarray/tests/test_dask.py:33: in <module>
    dd = pytest.importorskip("dask.dataframe")
../../../micromamba/envs/xarray-tests/lib/python3.12/site-packages/dask/dataframe/__init__.py:87: in <module>
    if _dask_expr_enabled():
../../../micromamba/envs/xarray-tests/lib/python3.12/site-packages/dask/dataframe/__init__.py:24: in _dask_expr_enabled
    raise ValueError("Must install dask-expr to activate query planning.")
E   ValueError: Must install dask-expr to activate query planning.

It looks like the _dask_expr_enabled() function doesn't catch an ImportError

mrocklin commented 8 months ago

I wonder if, for the short term, we should make this a warning and fall back to the legacy implementation.

cc @fjetter @phofl

dcherian commented 8 months ago

Without that you'd have to make dask-expr a required dependency (which would be fine by us :) )

mrocklin commented 8 months ago

We have actually. You need to pip install dask[dataframe] or pip install dask[complete]. My guess is that xarray is depending only on dask[array] which doesn't even strictly require pandas.

jrbourbeau commented 8 months ago

dask-expr is a hard dependency for dask.dataframe. It looks like you're installing dask-core in CI (doesn't have dependencies like numpy / pandas / etc). Does changing that to dask instead of dask-core work for Xarray? That would pull in dask-expr automatically

phofl commented 8 months ago

I wonder if, for the short term, we should make this a warning and fall back to the legacy implementation.

I don't think that this is a good idea if we want people to actually use it

mrocklin commented 8 months ago

We would warn and then gracefully fall back to old behavior.

That seems better to me than failing hard.

phofl commented 8 months ago

People mostly ignored the deprecation warning even though it was very noisy, I don't think that this would be any different.

dcherian commented 8 months ago

Ah yes, we missed adding dask-expr to one of our environments. (we are installing dask-core from conda)

mrocklin commented 8 months ago

People mostly ignored the deprecation warning even though it was very noisy, I don't think that this would be any different.

Well, like in that case it was pretty disruptive and people were sad. We've learned that people don't like being disrupted. I'm inclined to give them the warning information for now but also gracefully fall back. I don't want to force people's code to break in order to get them to upgrade; at least not yet.

dcherian commented 8 months ago

We had another subtle issue from dask-expr setting a pandas global option mode.copy_on_write: https://github.com/pydata/xarray/pull/8829#issuecomment-1996044029

fjetter commented 8 months ago

We had another subtle issue from dask-expr setting a pandas global option mode.copy_on_write: pydata/xarray#8829 (comment)

This is a known issue https://github.com/dask/dask/issues/10996

dcherian commented 8 months ago

More failures :) See https://github.com/pydata/xarray/actions/runs/8288508222/job/22683171250?pr=8790

_____________________ TestDataArray.test_to_dask_dataframe _____________________
[gw1] linux -- Python 3.9.18 /home/runner/micromamba/envs/xarray-tests/bin/python

self = (Concat(frames=[FromGraph(58f3d2e), FromGraph(84f3e3e), FromGraph(b57bdfe)], axis=1))['foo']
key = 'values'

    def __getattr__(self, key):
        try:
>           return object.__getattribute__(self, key)
E           AttributeError: 'Projection' object has no attribute 'values'

/home/runner/micromamba/envs/xarray-tests/lib/python3.9/site-packages/dask_expr/_core.py:446: AttributeError

During handling of the above exception, another exception occurred:

self = Dask Series Structure:
npartitions=1
0     int64
11      ...
Dask Name: getitem, 5 expressions
Expr=(Concat(frames=[FromGraph(58f3d2e), FromGraph(84f3e3e), FromGraph(b57bdfe)], axis=1))['foo']
key = 'values'

    def __getattr__(self, key):
        try:
            # Prioritize `FrameBase` attributes
            return object.__getattribute__(self, key)
        except AttributeError as err:
            try:
                # Fall back to `expr` API
                # (Making sure to convert to/from Expr)
>               val = getattr(self.expr, key)

/home/runner/micromamba/envs/xarray-tests/lib/python3.9/site-packages/dask_expr/_collection.py:511: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = (Concat(frames=[FromGraph(58f3d2e), FromGraph(84f3e3e), FromGraph(b57bdfe)], axis=1))['foo']
key = 'values'

    def __getattr__(self, key):
        try:
            return object.__getattribute__(self, key)
        except AttributeError as err:
            if key.startswith("_meta"):
                # Avoid a recursive loop if/when `self._meta*`
                # produces an `AttributeError`
                raise RuntimeError(
                    f"Failed to generate metadata for {self}. "
                    "This operation may not be supported by the current backend."
                )

            # Allow operands to be accessed as attributes
            # as long as the keys are not already reserved
            # by existing methods/properties
            _parameters = type(self)._parameters
            if key in _parameters:
                idx = _parameters.index(key)
                return self.operands[idx]
            if is_dataframe_like(self._meta) and key in self._meta.columns:
                return self[key]

            link = "https://github.com/dask-contrib/dask-expr/blob/main/README.md#api-coverage"
>           raise AttributeError(
                f"{err}\n\n"
                "This often means that you are attempting to use an unsupported "
                f"API function. Current API coverage is documented here: {link}."
            )
E           AttributeError: 'Projection' object has no attribute 'values'
E           
E           This often means that you are attempting to use an unsupported API function. Current API coverage is documented here: https://github.com/dask-contrib/dask-expr/blob/main/README.md#api-coverage.

/home/runner/micromamba/envs/xarray-tests/lib/python3.9/site-packages/dask_expr/_core.py:467: AttributeError

During handling of the above exception, another exception occurred:

self = <xarray.tests.test_dataarray.TestDataArray object at 0x7187713d36a0>

    @requires_dask_expr
    @requires_dask
    def test_to_dask_dataframe(self) -> None:
        arr_np = np.arange(3 * 4).reshape(3, 4)
        arr = DataArray(arr_np, [("B", [1, 2, 3]), ("A", list("cdef"))], name="foo")
        expected = arr.to_series()
        actual = arr.to_dask_dataframe()["foo"]

>       assert_array_equal(actual.values, expected.values)

/home/runner/work/xarray/xarray/xarray/tests/test_dataarray.py:3429: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/home/runner/micromamba/envs/xarray-tests/lib/python3.9/site-packages/dask_expr/_collection.py:517: in __getattr__
    raise err
/home/runner/micromamba/envs/xarray-tests/lib/python3.9/site-packages/dask_expr/_collection.py:506: in __getattr__
    return object.__getattribute__(self, key)
/home/runner/micromamba/envs/xarray-tests/lib/python3.9/site-packages/dask_expr/_collection.py:1295: in values
    return self.to_dask_array()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = Dask Series Structure:
npartitions=1
0     int64
11      ...
Dask Name: getitem, 5 expressions
Expr=(Concat(frames=[FromGraph(58f3d2e), FromGraph(84f3e3e), FromGraph(b57bdfe)], axis=1))['foo']
lengths = None, meta = None, optimize = True, optimize_kwargs = {}

    def to_dask_array(
        self, lengths=None, meta=None, optimize: bool = True, **optimize_kwargs
    ) -> Array:
        """Convert a dask DataFrame to a dask array.

        Parameters
        ----------
        lengths : bool or Sequence of ints, optional
            How to determine the chunks sizes for the output array.
            By default, the output array will have unknown chunk lengths
            along the first axis, which can cause some later operations
            to fail.

            * True : immediately compute the length of each partition
            * Sequence : a sequence of integers to use for the chunk sizes
              on the first axis. These values are *not* validated for
              correctness, beyond ensuring that the number of items
              matches the number of partitions.
        meta : object, optional
            An optional `meta` parameter can be passed for dask to override the
            default metadata on the underlying dask array.
        optimize : bool
            Whether to optimize the expression before converting to an Array.

        Returns
        -------
        A Dask Array
        """
>       return self.to_dask_dataframe(optimize, **optimize_kwargs).to_dask_array(
            lengths=lengths, meta=meta
        )
E       AttributeError: 'Scalar' object has no attribute 'to_dask_array'
phofl commented 8 months ago

@dcherian put up a pr to fix this test: https://github.com/dask/dask-expr/pull/981