dask / dask-expr

BSD 3-Clause "New" or "Revised" License
79 stars 18 forks source link

Shuffling with categorical data raises `AttributeError: 'ArrowStringArray' object has no attribute 'categories'` #1056

Open hendrikmakait opened 1 month ago

hendrikmakait commented 1 month ago

Describe the issue:

Minimal Complete Verifiable Example:

import dask.dataframe as dd
df = dd.from_dict(
    {
        "a": [1, 2, 3, 4, 5],
        "b": [
            "x",
            "y",
            "x",
            "y",
            "z",
        ],
    },
    npartitions=2,
)
df.b = df.b.astype("category")
res = df.shuffle("a").compute()

raises

Traceback (most recent call last):
  File "/Users/hendrikmakait/projects/dask/dask-expr/reproducer.py", line 16, in <module>
    res = df.shuffle("a").compute()
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/hendrikmakait/projects/dask/dask-expr/dask_expr/_collection.py", line 476, in compute
    return DaskMethodsMixin.compute(out, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/dask-expr/lib/python3.12/site-packages/dask/base.py", line 375, in compute
    (result,) = compute(self, traverse=False, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/dask-expr/lib/python3.12/site-packages/dask/base.py", line 661, in compute
    results = schedule(dsk, keys, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/dask-expr/lib/python3.12/site-packages/dask/dataframe/dispatch.py", line 68, in concat
    return func(
           ^^^^^
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/dask-expr/lib/python3.12/site-packages/dask/dataframe/backends.py", line 676, in concat_pandas
    out[col] = union_categoricals(parts, ignore_order=ignore_order)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/dask-expr/lib/python3.12/site-packages/pandas/core/dtypes/concat.py", line 304, in union_categoricals
    if not lib.dtypes_all_equal([obj.categories.dtype for obj in to_union]):
                                 ^^^^^^^^^^^^^^
AttributeError: 'ArrowStringArray' object has no attribute 'categories'
(dask-expr)

FWIW, it doesn't matter whether I shuffle on a or b.