Open jakirkham opened 1 year ago
Please fill free to edit and fill this issue out Marvin 🙂
Just needed a placeholder for tracking
Would also be good to make a note in issue (where feedback is being collected): https://github.com/dask/dask/issues/10139
Ideally with a simple reproducer
Despite the passing tests, potentially users who installed dask-image over conda would still experience the above described problem when using ndmeasure.find_objects
(need to check whether there's more affected).
Perhaps a suitable fix for this on the dask-image
side would be to set dask.config.set({"dataframe.convert-string": False})
using a context manager around the affected functionality? See this recommendation.
Edit:
Confirmed that using with dask_config.set({'dataframe.convert-string': False}):
around
https://github.com/dask/dask-image/blob/67540af25597f84e4a642d644ba30dce7aebe753/dask_image/ndmeasure/__init__.py#L243-L247
and
https://github.com/dask/dask-image/blob/67540af25597f84e4a642d644ba30dce7aebe753/dask_image/ndmeasure/_utils/_find_objects.py#L68-L74
fixes the errors.
Would also be good to make a note in issue (where feedback is being collected): dask/dask#10139
Ideally with a simple reproducer
I've made a comment here, but no reproducer (I'm not planning to do more work on this, it's open for anyone who wants it)
Yeah think we are not seeing this in CI as it requires a newer version of Dask than we are testing. Perhaps we should upgrade one of the CI environments (like 3.11) to a very recent Dask version
Tbh I've not looked deeply into the Dask Arrow work. Have heard about it mainly in passing. So not sure how dask.config
handles it
Should add this pain point is not unique to us. We had to disable this feature in Dask-SQL recently as well ( https://github.com/dask-contrib/dask-sql/pull/1206 ). Unclear whether this is due to upstream bugs or if we need to make changes
Yeah think we are not seeing this in CI as it requires a newer version of Dask than we are testing. Perhaps we should upgrade one of the CI environments (like 3.11) to a very recent Dask version
We could add an "upstream" CI environment, that just uses whatever the latest (or even pre-release?) versions are, maybe?
There are Dask nightly packages. So that would be easy to add
As far as I understand, for reproducing the test failure in CI, next to a recent dask version we'd need arrow>=7
present in the environment. This is a regular dependency of the conda , but not the pypi dask distribution.
Is this still an issue with recent Dask releases? Asking as they may have fixed something upstream since this occurred
I'm not seeing any flaky/failing tests, so I don't think this is still happening currently. I'll close the issue, and if it pops up again we can re-open.
Yeah I think part of the issue before was CI doesn't capture this edge case. Though maybe it should
Reopening as this issue just came up in https://github.com/dask/dask-image/issues/355.
Think this may have been fixed in the intervening time. The test suite no longer fails for me
(Edited by @m-albert)
In the presence of
pyarrow
, dask by default assumes dataframes of type object to be pyarrow strings (see https://github.com/dask/dask/issues/10139#issuecomment-1655928619).This creates problems revealed by failing tests (e.g.
test_dask_image/test_ndmeasure/test_find_objects.py::test_3d_find_objects
)https://github.com/dask/dask-image/blob/67540af25597f84e4a642d644ba30dce7aebe753/dask_image/ndmeasure/_utils/_find_objects.py#L68-L70
dd.from_delayed(df1, meta=meta).compute().dtypes
Working install:
Failing install:
The failing test had come up when releasing v2023.08.0 in https://github.com/conda-forge/dask-image-feedstock/pull/14.
@jakirkham found that
pyarrow
is installed with the conda distribution of dask, but not when installing over pip, where it just part of the[complete]
target.Also @jakirkham found that the above described conflicting behaviour can be turned off using the dask configuration.
He did this for the tests performed by the dask-image conda feedstock on v2023.08.0.