Open lukaswenzl-akur8 opened 1 week ago
The failure happens when the total number of characters reaches the size of an unsigned 32bit integer (
np.sum(df["float_gran"].cat.categories.str.len()) > 2_147_483_647
), indicating it may be an int32 Overflow issue. .. It seems from_dataframe avoids the error by leveraging a 'large_string' datatype
It's indeed related to that. A single Array with the string
type can only hold a limited total number of characters for all elements combined, because it uses int32 offsets. The large_string
type on the other hand uses int64 offsets (spec).
The problem here is that when we are converting the pandas Categorical column, we convert the integer codes and the actual categories (the unique values) separately to a pyarrow array. And when converting the categories, we bump into the issue that it does not fit into a single array. At that point the pa.array(..)
function will automatically fall back to returning a chunked array:
# using your above df
>>> values = df["float_gran"].array
>>> pa.array(values.categories.values)
<pyarrow.lib.ChunkedArray object at 0x7f607b87c520>
[
[
"0.00010000548144684096",
"0.00010002117808627364",
...
"0.9792001085756353",
"0.9792001280159454"
],
[
"0.9792001297798442",
"0.9792001326304284",
...
"9.997302630371241e-05",
"9.999832524965058e-05"
]
]
But what causes the error then, is that we try to create the DictionaryArray using from_arrays
, so simplified something like:
indices = pa.array(values.codes)
dictionary = pa.array(values.categories)
result = pa.DictionaryArray.from_arrays(indices, dictionary)
and this method cannot handle the ChunkedArray input, it expects two Arrays.
This is a problem in our implementation, though, and something we should fix.
What you can do on the short term:
large_string
for the resulting pyarrow dictionary type. This can be done through specifying a schema, although that is a bit inconvenient though. And you are correct that right now this doesn't get preserved in a roundtrip (this will get solved in pandas 3.0, though, because then pandas will start using large_string by default on their side as well)Example with specifying the schema with large_string for the conversion of the DataFrame:
>>> schema = pa.schema([("float_gran", pa.dictionary(pa.int64(), pa.large_string()))])
>>> pa.Table.from_pandas(df, schema=schema)
pyarrow.Table
float_gran: dictionary<values=large_string, indices=int64, ordered=0>
----
float_gran: [ -- dictionary:
["0.00010000548144684096","0.00010002117808627364","0.00010002197545089242","0.00010004332387836268","0.00010004725275269966",...,"9.996006809298574e-05","9.996161281367044e-05","9.996313487481423e-05","9.997302630371241e-05","9.999832524965058e-05"] -- indices:
[73773280,42540778,53064062,56325053,949787,...,48597326,570806,117918239,71806143,102880880]]
Of course the above is only for a single column, and so the annoying part is that you have to specify the full schema
at the moment, i.e. so also for all other columns where the type inference would be fine.
Thanks for your quick answer and insights! You are right that this is an extreme edge case that is rare, but we want to avoid crashes.
For now we could use the workaround to convert to strings. schematically:
if (np.sum(df["float_gran"].cat.categories.str.len()) > 2_147_483_647):
df["float_gran"] = df["float_gran"].astype(str)
#...works
table.to_pandas().astype("category")
this comes with a large performance penalty for the conversions but at least doesn't crash and only affects the edgecase. Building the whole schema each time could be prone to errors for our more general use case.
It is great to know that the upcoming pandas version may solve this. We will retest with pandas 3.0!
Describe the bug, including details regarding any error messages, version, and platform.
Converting from pandas to pyarrow with Table.from_pandas for dataframes with categorical columns with large dictionaries fails. Similarly loading such a column from a parquet file and converting to pandas with Table.to_pandas() fails.
The failure happens when the total number of characters reaches the size of an unsigned 32bit integer (
np.sum(df["float_gran"].cat.categories.str.len()) > 2_147_483_647
), indicating it may be an int32 Overflow issue.Below an example code that reproduces the failure
Note: the same error message was noted in issue #41936, but there the discussion was about RecordBatch and it was noted Table.from_pandas, used here, should work fine.
Tested on macos Sonoma 14.5, errors also happened on linux servers
It seems from_dataframe avoids the error by leveraging a 'large_string' datatype. However we find the from_dataframe method to perform significantly worse than from_pandas in most cases and would therefore like to avoid using it. Additionally the large_string datatype seems to be lost on reload.
Is there already a way to reliably avoid the TypeError and ArrowCapacityError in the optimized methods for pandas and is this a bug that could be fixed in future versions?
Component(s)
Python