Open mrocklin opened 9 months ago
Our tests are failing since the env doesn't contain dask-expr
../../../micromamba/envs/xarray-tests/lib/python3.12/site-packages/dask/dataframe/__init__.py:22: in _dask_expr_enabled
import dask_expr # noqa: F401
E ModuleNotFoundError: No module named 'dask_expr'
During handling of the above exception, another exception occurred:
xarray/tests/test_dask.py:33: in <module>
dd = pytest.importorskip("dask.dataframe")
../../../micromamba/envs/xarray-tests/lib/python3.12/site-packages/dask/dataframe/__init__.py:87: in <module>
if _dask_expr_enabled():
../../../micromamba/envs/xarray-tests/lib/python3.12/site-packages/dask/dataframe/__init__.py:24: in _dask_expr_enabled
raise ValueError("Must install dask-expr to activate query planning.")
E ValueError: Must install dask-expr to activate query planning.
It looks like the _dask_expr_enabled()
function doesn't catch an ImportError
I wonder if, for the short term, we should make this a warning and fall back to the legacy implementation.
cc @fjetter @phofl
Without that you'd have to make dask-expr
a required dependency (which would be fine by us :) )
We have actually. You need to pip install dask[dataframe]
or pip install dask[complete]
. My guess is that xarray is depending only on dask[array]
which doesn't even strictly require pandas.
dask-expr
is a hard dependency for dask.dataframe
. It looks like you're installing dask-core
in CI (doesn't have dependencies like numpy
/ pandas
/ etc). Does changing that to dask
instead of dask-core
work for Xarray? That would pull in dask-expr
automatically
I wonder if, for the short term, we should make this a warning and fall back to the legacy implementation.
I don't think that this is a good idea if we want people to actually use it
We would warn and then gracefully fall back to old behavior.
That seems better to me than failing hard.
People mostly ignored the deprecation warning even though it was very noisy, I don't think that this would be any different.
Ah yes, we missed adding dask-expr
to one of our environments. (we are installing dask-core
from conda)
People mostly ignored the deprecation warning even though it was very noisy, I don't think that this would be any different.
Well, like in that case it was pretty disruptive and people were sad. We've learned that people don't like being disrupted. I'm inclined to give them the warning information for now but also gracefully fall back. I don't want to force people's code to break in order to get them to upgrade; at least not yet.
We had another subtle issue from dask-expr
setting a pandas global option mode.copy_on_write
: https://github.com/pydata/xarray/pull/8829#issuecomment-1996044029
We had another subtle issue from
dask-expr
setting a pandas global optionmode.copy_on_write
: pydata/xarray#8829 (comment)
This is a known issue https://github.com/dask/dask/issues/10996
More failures :) See https://github.com/pydata/xarray/actions/runs/8288508222/job/22683171250?pr=8790
_____________________ TestDataArray.test_to_dask_dataframe _____________________
[gw1] linux -- Python 3.9.18 /home/runner/micromamba/envs/xarray-tests/bin/python
self = (Concat(frames=[FromGraph(58f3d2e), FromGraph(84f3e3e), FromGraph(b57bdfe)], axis=1))['foo']
key = 'values'
def __getattr__(self, key):
try:
> return object.__getattribute__(self, key)
E AttributeError: 'Projection' object has no attribute 'values'
/home/runner/micromamba/envs/xarray-tests/lib/python3.9/site-packages/dask_expr/_core.py:446: AttributeError
During handling of the above exception, another exception occurred:
self = Dask Series Structure:
npartitions=1
0 int64
11 ...
Dask Name: getitem, 5 expressions
Expr=(Concat(frames=[FromGraph(58f3d2e), FromGraph(84f3e3e), FromGraph(b57bdfe)], axis=1))['foo']
key = 'values'
def __getattr__(self, key):
try:
# Prioritize `FrameBase` attributes
return object.__getattribute__(self, key)
except AttributeError as err:
try:
# Fall back to `expr` API
# (Making sure to convert to/from Expr)
> val = getattr(self.expr, key)
/home/runner/micromamba/envs/xarray-tests/lib/python3.9/site-packages/dask_expr/_collection.py:511:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = (Concat(frames=[FromGraph(58f3d2e), FromGraph(84f3e3e), FromGraph(b57bdfe)], axis=1))['foo']
key = 'values'
def __getattr__(self, key):
try:
return object.__getattribute__(self, key)
except AttributeError as err:
if key.startswith("_meta"):
# Avoid a recursive loop if/when `self._meta*`
# produces an `AttributeError`
raise RuntimeError(
f"Failed to generate metadata for {self}. "
"This operation may not be supported by the current backend."
)
# Allow operands to be accessed as attributes
# as long as the keys are not already reserved
# by existing methods/properties
_parameters = type(self)._parameters
if key in _parameters:
idx = _parameters.index(key)
return self.operands[idx]
if is_dataframe_like(self._meta) and key in self._meta.columns:
return self[key]
link = "https://github.com/dask-contrib/dask-expr/blob/main/README.md#api-coverage"
> raise AttributeError(
f"{err}\n\n"
"This often means that you are attempting to use an unsupported "
f"API function. Current API coverage is documented here: {link}."
)
E AttributeError: 'Projection' object has no attribute 'values'
E
E This often means that you are attempting to use an unsupported API function. Current API coverage is documented here: https://github.com/dask-contrib/dask-expr/blob/main/README.md#api-coverage.
/home/runner/micromamba/envs/xarray-tests/lib/python3.9/site-packages/dask_expr/_core.py:467: AttributeError
During handling of the above exception, another exception occurred:
self = <xarray.tests.test_dataarray.TestDataArray object at 0x7187713d36a0>
@requires_dask_expr
@requires_dask
def test_to_dask_dataframe(self) -> None:
arr_np = np.arange(3 * 4).reshape(3, 4)
arr = DataArray(arr_np, [("B", [1, 2, 3]), ("A", list("cdef"))], name="foo")
expected = arr.to_series()
actual = arr.to_dask_dataframe()["foo"]
> assert_array_equal(actual.values, expected.values)
/home/runner/work/xarray/xarray/xarray/tests/test_dataarray.py:3429:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/home/runner/micromamba/envs/xarray-tests/lib/python3.9/site-packages/dask_expr/_collection.py:517: in __getattr__
raise err
/home/runner/micromamba/envs/xarray-tests/lib/python3.9/site-packages/dask_expr/_collection.py:506: in __getattr__
return object.__getattribute__(self, key)
/home/runner/micromamba/envs/xarray-tests/lib/python3.9/site-packages/dask_expr/_collection.py:1295: in values
return self.to_dask_array()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = Dask Series Structure:
npartitions=1
0 int64
11 ...
Dask Name: getitem, 5 expressions
Expr=(Concat(frames=[FromGraph(58f3d2e), FromGraph(84f3e3e), FromGraph(b57bdfe)], axis=1))['foo']
lengths = None, meta = None, optimize = True, optimize_kwargs = {}
def to_dask_array(
self, lengths=None, meta=None, optimize: bool = True, **optimize_kwargs
) -> Array:
"""Convert a dask DataFrame to a dask array.
Parameters
----------
lengths : bool or Sequence of ints, optional
How to determine the chunks sizes for the output array.
By default, the output array will have unknown chunk lengths
along the first axis, which can cause some later operations
to fail.
* True : immediately compute the length of each partition
* Sequence : a sequence of integers to use for the chunk sizes
on the first axis. These values are *not* validated for
correctness, beyond ensuring that the number of items
matches the number of partitions.
meta : object, optional
An optional `meta` parameter can be passed for dask to override the
default metadata on the underlying dask array.
optimize : bool
Whether to optimize the expression before converting to an Array.
Returns
-------
A Dask Array
"""
> return self.to_dask_dataframe(optimize, **optimize_kwargs).to_dask_array(
lengths=lengths, meta=meta
)
E AttributeError: 'Scalar' object has no attribute 'to_dask_array'
@dcherian put up a pr to fix this test: https://github.com/dask/dask-expr/pull/981
We'd like to make sure that Xarray works with the new version of dask dataframe
I took a quick look at the codebase, and it looks like they're using
dd.from_dask_array
, so everything should be ok. It might be worth checking with them explicitly though. cc @dcherian