apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.3k stars 3.47k forks source link

[Python] Pyarrow conversion from and to pandas fails for categorical variables with large dictionaries #44048

Open lukaswenzl-akur8 opened 1 week ago

lukaswenzl-akur8 commented 1 week ago

Describe the bug, including details regarding any error messages, version, and platform.

Converting from pandas to pyarrow with Table.from_pandas for dataframes with categorical columns with large dictionaries fails. Similarly loading such a column from a parquet file and converting to pandas with Table.to_pandas() fails.

The failure happens when the total number of characters reaches the size of an unsigned 32bit integer (np.sum(df["float_gran"].cat.categories.str.len()) > 2_147_483_647), indicating it may be an int32 Overflow issue.

Below an example code that reproduces the failure

>>> import pyarrow as pa
>>> import pyarrow.parquet as pq
>>> import pandas as pd
>>>pd.__version__
'2.2.2'
>>> import numpy as np
>>> pa.__version__
'17.0.0'
>>> n_rows =  120_000_000
>>> df = pd.DataFrame()
>>> df["float_gran"] = np.random.rand(n_rows)
>>> df["float_gran"] = df["float_gran"].astype(str).astype("category")
>>> pa.Table.from_pandas(df)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[6], line 1
----> 1 pa.Table.from_pandas(df)

File ~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/table.pxi:4623, in pyarrow.lib.Table.from_pandas()

File ~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/pandas_compat.py:616, in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
    611     return (isinstance(arr, np.ndarray) and
    612             arr.flags.contiguous and
    613             issubclass(arr.dtype.type, np.integer))
    615 if nthreads == 1:
--> 616     arrays = [convert_column(c, f)
    617               for c, f in zip(columns_to_convert, convert_fields)]
    618 else:
    619     arrays = []

File ~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/pandas_compat.py:616, in <listcomp>(.0)
    611     return (isinstance(arr, np.ndarray) and
    612             arr.flags.contiguous and
    613             issubclass(arr.dtype.type, np.integer))
    615 if nthreads == 1:
--> 616     arrays = [convert_column(c, f)
    617               for c, f in zip(columns_to_convert, convert_fields)]
    618 else:
    619     arrays = []

File ~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/pandas_compat.py:597, in dataframe_to_arrays.<locals>.convert_column(col, field)
    594     type_ = field.type
    596 try:
--> 597     result = pa.array(col, type=type_, from_pandas=True, safe=safe)
    598 except (pa.ArrowInvalid,
    599         pa.ArrowNotImplementedError,
    600         pa.ArrowTypeError) as e:
    601     e.args += ("Conversion failed for column {!s} with type {!s}"
    602                .format(col.name, col.dtype),)

File ~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/array.pxi:346, in pyarrow.lib.array()

File ~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/array.pxi:3863, in pyarrow.lib.DictionaryArray.from_arrays()

TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array

Note: the same error message was noted in issue #41936, but there the discussion was about RecordBatch and it was noted Table.from_pandas, used here, should work fine.

>>>from pyarrow.interchange import from_dataframe
>>>table = from_dataframe(df)
>>>table
pyarrow.Table
float_gran: dictionary<values=large_string, indices=int32, ordered=0>
----
float_gran: [  -- dictionary:
["0.00010000625394479545","0.00010000637024687453","0.00010002156605048995","0.00010002375983830802","0.00010003348618559116",...,"9.996332028860966e-05","9.996352231378403e-05","9.996744926299428e-06","9.99769273748452e-05","9.99829165827526e-05"]  -- indices:
[72271820,16482433,4156153,49996213,77690435,...,24623248,57247299,27016212,102115156,112204811]]
>>>table.to_pandas()
>>>#works!
>>>pq.write_table(table, "test", use_dictionary=False)
>>>table_loaded = pq.read_table("test")
>>>table_loaded.to_pandas()
---------------------------------------------------------------------------
ArrowCapacityError                        Traceback (most recent call last)
Cell In[18], line 1
----> 1 table_loaded.to_pandas()

File ~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/array.pxi:885, in pyarrow.lib._PandasConvertible.to_pandas()

File ~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/table.pxi:5002, in pyarrow.lib.Table._to_pandas()

File ~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/pandas_compat.py:784, in table_to_dataframe(options, table, categories, ignore_metadata, types_mapper)
    781 columns = _deserialize_column_index(table, all_columns, column_indexes)
    783 column_names = table.column_names
--> 784 result = pa.lib.table_to_blocks(options, table, categories,
    785                                 list(ext_columns_dtypes.keys()))
    786 if _pandas_api.is_ge_v3():
    787     from pandas.api.internals import create_dataframe_from_blocks

File ~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/table.pxi:3941, in pyarrow.lib.table_to_blocks()

File ~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()

ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 2147483657

Tested on macos Sonoma 14.5, errors also happened on linux servers

It seems from_dataframe avoids the error by leveraging a 'large_string' datatype. However we find the from_dataframe method to perform significantly worse than from_pandas in most cases and would therefore like to avoid using it. Additionally the large_string datatype seems to be lost on reload.

Is there already a way to reliably avoid the TypeError and ArrowCapacityError in the optimized methods for pandas and is this a bug that could be fixed in future versions?

Component(s)

Python

jorisvandenbossche commented 1 week ago

The failure happens when the total number of characters reaches the size of an unsigned 32bit integer (np.sum(df["float_gran"].cat.categories.str.len()) > 2_147_483_647), indicating it may be an int32 Overflow issue. .. It seems from_dataframe avoids the error by leveraging a 'large_string' datatype

It's indeed related to that. A single Array with the string type can only hold a limited total number of characters for all elements combined, because it uses int32 offsets. The large_string type on the other hand uses int64 offsets (spec).

The problem here is that when we are converting the pandas Categorical column, we convert the integer codes and the actual categories (the unique values) separately to a pyarrow array. And when converting the categories, we bump into the issue that it does not fit into a single array. At that point the pa.array(..) function will automatically fall back to returning a chunked array:

# using your above df
>>> values = df["float_gran"].array
>>> pa.array(values.categories.values)
<pyarrow.lib.ChunkedArray object at 0x7f607b87c520>
[
  [
    "0.00010000548144684096",
    "0.00010002117808627364",
    ...
    "0.9792001085756353",
    "0.9792001280159454"
  ],
  [
    "0.9792001297798442",
    "0.9792001326304284",
    ...
    "9.997302630371241e-05",
    "9.999832524965058e-05"
  ]
]

But what causes the error then, is that we try to create the DictionaryArray using from_arrays, so simplified something like:

indices = pa.array(values.codes)
dictionary = pa.array(values.categories)
result = pa.DictionaryArray.from_arrays(indices, dictionary)

and this method cannot handle the ChunkedArray input, it expects two Arrays.

This is a problem in our implementation, though, and something we should fix.


What you can do on the short term:

jorisvandenbossche commented 1 week ago

Example with specifying the schema with large_string for the conversion of the DataFrame:

>>> schema = pa.schema([("float_gran", pa.dictionary(pa.int64(), pa.large_string()))])
>>> pa.Table.from_pandas(df, schema=schema)
pyarrow.Table
float_gran: dictionary<values=large_string, indices=int64, ordered=0>
----
float_gran: [  -- dictionary:
["0.00010000548144684096","0.00010002117808627364","0.00010002197545089242","0.00010004332387836268","0.00010004725275269966",...,"9.996006809298574e-05","9.996161281367044e-05","9.996313487481423e-05","9.997302630371241e-05","9.999832524965058e-05"]  -- indices:
[73773280,42540778,53064062,56325053,949787,...,48597326,570806,117918239,71806143,102880880]]

Of course the above is only for a single column, and so the annoying part is that you have to specify the full schema at the moment, i.e. so also for all other columns where the type inference would be fine.

lukaswenzl-akur8 commented 1 week ago

Thanks for your quick answer and insights! You are right that this is an extreme edge case that is rare, but we want to avoid crashes.

For now we could use the workaround to convert to strings. schematically:

if (np.sum(df["float_gran"].cat.categories.str.len()) > 2_147_483_647):
  df["float_gran"] = df["float_gran"].astype(str)
#...works
table.to_pandas().astype("category")

this comes with a large performance penalty for the conversions but at least doesn't crash and only affects the edgecase. Building the whole schema each time could be prone to errors for our more general use case.

It is great to know that the upcoming pandas version may solve this. We will retest with pandas 3.0!