dask / dask-image

Distributed image processing
http://image.dask.org/en/latest/
BSD 3-Clause "New" or "Revised" License
210 stars 47 forks source link

New Dask Arrow-based strings cause test failures #335

Open jakirkham opened 1 year ago

jakirkham commented 1 year ago

(Edited by @m-albert)

In the presence of pyarrow, dask by default assumes dataframes of type object to be pyarrow strings (see https://github.com/dask/dask/issues/10139#issuecomment-1655928619).

This creates problems revealed by failing tests (e.g. test_dask_image/test_ndmeasure/test_find_objects.py::test_3d_find_objects)

https://github.com/dask/dask-image/blob/67540af25597f84e4a642d644ba30dce7aebe753/dask_image/ndmeasure/_utils/_find_objects.py#L68-L70

dd.from_delayed(df1, meta=meta).compute().dtypes

Working install:

0 object 1 object 2 object dtype: object

Failing install:

0 string[pyarrow] 1 string[pyarrow] 2 string[pyarrow] dtype: object

The failing test had come up when releasing v2023.08.0 in https://github.com/conda-forge/dask-image-feedstock/pull/14.

@jakirkham found that pyarrow is installed with the conda distribution of dask, but not when installing over pip, where it just part of the [complete] target.

Also @jakirkham found that the above described conflicting behaviour can be turned off using the dask configuration.

He did this for the tests performed by the dask-image conda feedstock on v2023.08.0.

jakirkham commented 1 year ago

Please fill free to edit and fill this issue out Marvin 🙂

Just needed a placeholder for tracking

jakirkham commented 1 year ago

Would also be good to make a note in issue (where feedback is being collected): https://github.com/dask/dask/issues/10139

Ideally with a simple reproducer

m-albert commented 1 year ago

Despite the passing tests, potentially users who installed dask-image over conda would still experience the above described problem when using ndmeasure.find_objects (need to check whether there's more affected).

Perhaps a suitable fix for this on the dask-image side would be to set dask.config.set({"dataframe.convert-string": False}) using a context manager around the affected functionality? See this recommendation.

Edit: Confirmed that using with dask_config.set({'dataframe.convert-string': False}): around https://github.com/dask/dask-image/blob/67540af25597f84e4a642d644ba30dce7aebe753/dask_image/ndmeasure/__init__.py#L243-L247 and https://github.com/dask/dask-image/blob/67540af25597f84e4a642d644ba30dce7aebe753/dask_image/ndmeasure/_utils/_find_objects.py#L68-L74 fixes the errors.

GenevieveBuckley commented 1 year ago

Would also be good to make a note in issue (where feedback is being collected): dask/dask#10139

Ideally with a simple reproducer

I've made a comment here, but no reproducer (I'm not planning to do more work on this, it's open for anyone who wants it)

jakirkham commented 1 year ago

Yeah think we are not seeing this in CI as it requires a newer version of Dask than we are testing. Perhaps we should upgrade one of the CI environments (like 3.11) to a very recent Dask version

Tbh I've not looked deeply into the Dask Arrow work. Have heard about it mainly in passing. So not sure how dask.config handles it

Should add this pain point is not unique to us. We had to disable this feature in Dask-SQL recently as well ( https://github.com/dask-contrib/dask-sql/pull/1206 ). Unclear whether this is due to upstream bugs or if we need to make changes

GenevieveBuckley commented 1 year ago

Yeah think we are not seeing this in CI as it requires a newer version of Dask than we are testing. Perhaps we should upgrade one of the CI environments (like 3.11) to a very recent Dask version

We could add an "upstream" CI environment, that just uses whatever the latest (or even pre-release?) versions are, maybe?

jakirkham commented 1 year ago

There are Dask nightly packages. So that would be easy to add

m-albert commented 1 year ago

As far as I understand, for reproducing the test failure in CI, next to a recent dask version we'd need arrow>=7 present in the environment. This is a regular dependency of the conda , but not the pypi dask distribution.

jakirkham commented 8 months ago

Is this still an issue with recent Dask releases? Asking as they may have fixed something upstream since this occurred

GenevieveBuckley commented 8 months ago

I'm not seeing any flaky/failing tests, so I don't think this is still happening currently. I'll close the issue, and if it pops up again we can re-open.

jakirkham commented 8 months ago

Yeah I think part of the issue before was CI doesn't capture this edge case. Though maybe it should

m-albert commented 8 months ago

Reopening as this issue just came up in https://github.com/dask/dask-image/issues/355.

jakirkham commented 5 months ago

Think this may have been fixed in the intervening time. The test suite no longer fails for me