dask / dask

Parallel computing with task scheduling
https://dask.org
BSD 3-Clause "New" or "Revised" License
12.59k stars 1.71k forks source link

Inconsistency in ddf.astype(Arrow Dict) #11078

Closed mscanlon-exos closed 6 months ago

mscanlon-exos commented 6 months ago

Describe the issue: When using an arrow dict type of string: int32, if directly set on the dask dataframe the values are coerced to a pa.large_string but if set from a preexisting coerced pandas df, the dask dataframe is correctly set to string: int32.

Minimal Complete Verifiable Example:

import pandas as pd
import pyarrow as pa
import dask.dataframe as dd

df = pd.DataFrame(data={'cat_col': ['FOO', 'BAR']})
ddf = dd.from_pandas(df)

data_type = pd.ArrowDtype(pa.dictionary(pa.int32(), pa.string()))

new_ddf = ddf.astype(dtypes={'cat_col': data_type})
assert(new_ddf.dtypes.cat_col != data_type)
assert(new_ddf.dtypes.cat_col == pd.ArrowDtype(pa.dictionary(pa.int32(), pa.large_string())))

new_df = df.astype(dtype={'cat_col': pd.ArrowDtype(pa.dictionary(pa.int32(), pa.string()))})
assert(new_df.dtypes.cat_col == data_type)

new_ddf_from_pandas_coerced = dd.from_pandas(new_df)
assert(new_ddf_from_pandas_coerced.dtypes.cat_col == data_type)

Anything else we need to know?: I would assume either way (string or large string) the behavior here should be consistent. My recommendation would be to keep it as a pa.string since that is the pandas behavior, but maybe there is a reason this has changed?

Environment:

phofl commented 6 months ago

Thanks for your report. This is actually a pandas bug, see https://github.com/pandas-dev/pandas/pull/58479 for a fix.

Closing here since we wait for the fix upstream