Open Aryik opened 6 days ago
Previous bug reports with the offset overflow are mostly around very large strings. In this case, we don't have any individual string that is larger than 2GB. Instead, we get the error when we are above a certain total size.
This is expected anyway. The binary and string types in Arrow store the offsets inside the data, so a string array with a total size greater than 2 GiB is not possible.
You should either keep the chunks separate (i.e. don't call combine_chunks
) or first convert them your string column to large_string (which uses 64-bit offsets).
The call to combine_chunks
works when they are top-level string arrays and only fails when they're nested inside a struct. Is that expected?
Can you show the result of combining the top-level string arrays?
@Aryik if it are the same strings, then combining the chunks in the non-nested case should also error.
Could you provide a reproducible example for that? (and ideally one that doesn't require 150GB memory to create it, as the actual data should only need to be a bit bigger than 2GB to trigger the error).
Example showing an error in both cases:
import string
import numpy as np
import pyarrow as pa
import pyarrow.compute as pc
data = ["".join(np.random.choice(list(string.ascii_letters), n)) for n in np.random.randint(10, 500, size=10_000)]
# will create a chunked array because data don't fit in a single array
chunked_array = pa.array(data * 1000)
nested_chunked_array = pc.make_struct(chunked_array, field_names=["string_field"])
with this example (where the string data > 2GB), I get the error for both plain string type as for nested in a struct:
In [24]: nested_chunked_array.nbytes
Out[24]: 2596156000
In [25]: nested_chunked_array.combine_chunks()
...
ArrowInvalid: offset overflow while concatenating arrays
In [26]: chunked_array.combine_chunks()
...
ArrowInvalid: offset overflow while concatenating arrays, consider casting input from `string` to `large_string` first.
This isn't a reproducible example because it came from our production source data, but here's some evidence for the difference in behavior with nested vs. non-nested:
# additional_properties is a nested dict that contains strings
only_additional_properties = [
{"additional_properties": item["additional_properties"]} for item in test_items
]
unnested_additional_properties = [item["additional_properties"] for item in test_items]
nested_table, nested_chunks = get_arrow_table_and_chunks(
only_additional_properties, ["additional_properties"]
)
unnested_table, unnested_chunks = get_arrow_table_and_chunks(
unnested_additional_properties, unnested_additional_properties[0].keys()
)
nested_table.nbytes
# -> 3668663888
unnested_table.nbytes
# -> 3668663888
nested_table["additional_properties"].combine_chunks()
# ArrowInvalid: offset overflow ...
# This is one of the columns inside "additional_properties"
unnested_table["primaryimagesrc"].nbytes
# -> 349,666,429
# This works
unnested_table["primaryimagesrc"].combine_chunks()
I will try to come up with a reproducible example
Can you show the schema of both nested_table
and unnested_table
?
Also unnested_table["primaryimagesrc"].nbytes
shows a result that is less than 2GB, so in that case it can indeed fit in a single chunk. But so that means that that column contains different strings?
After trying some variations, I cannot create a reproducible example, but I'm struggling to explain the behavior I saw with our production data.
Here's my attempt at repro, which passes without an exception:
import string
import numpy as np
import pyarrow as pa
data = [
"".join(np.random.choice(list(string.ascii_letters), n))
for n in np.random.randint(10, 500, size=10_000)
]
nested_table = pa.Table.from_pydict(
# 10x the nested data, because I thought the issue was related to the nesting and the unnested
# fails with data * 1000
{"nested": [{f"string_field_{i}": s for i in range(10)} for s in data * 100]}
)
unnested_table = pa.Table.from_pydict({"string_field": data * 100})
print(
f"Trying to combine chunks of unnested table with nbytes {unnested_table["string_field"].nbytes}"
)
unnested_table["string_field"].combine_chunks()
print(
f"Trying to combine chunks of nested table with nbytes {nested_table["nested"].nbytes}"
)
nested_table["nested"].combine_chunks()
In production, we have this function:
def dataframe_from_dicts(
dicts: List[Dict[str, Any]],
schema: HtSchemaDict | List[str],
) -> pl.DataFrame:
"""Convert a list of dictionaries into a DataFrame with the specified schema.
Args:
dicts (List[dict]): The list of dictionaries to convert to a DataFrame.
schema (HtSchemaDict or List[str]): The schema of the DataFrame. This supports either a dictionary of the schema
or a list of the schema keys if we don't know all the schema values and the values will be inferred.
Returns:
pl.DataFrame: The DataFrame.
"""
schema_keys = list(schema.keys() if isinstance(schema, dict) else schema)
# Convert to list of tuples. We need to provide a None value for any missing keys in the schema
# so the schema will fill in the missing values with nulls in our model transformations.
normalized_data = [
{key: row.get(key, None) for key in schema_keys} for row in dicts
]
# Convert to Arrow Table
arrow_table = pa.Table.from_pydict(
{key: [row[key] for row in normalized_data] for key in schema_keys}
)
# Provide chunks of the Arrow Table to polars. If we have too much data in a single chunk,
# we get these strange errors:
# pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
arrow_chunks = arrow_table.to_batches(max_chunksize=10000)
# Create Polars DataFrame from Arrow Table
return cast(
pl.DataFrame,
# Allow passing in an empty schema so that it can be inferred.
pl.from_arrow(arrow_chunks, schema=schema), # Throwing an offset error
)
When I tried to call this on the nested data, it failed:
dataframe_from_dicts(only_additional_properties, ["additional_properties"])
# ArrowInvalid: offset overflow while concatenating arrays
When I called it on the unnested data, it works and returns a dataframe. This is the thing I can't wrap my head around. If the issue is that one of the string columns is too large, how could it work here?
dataframe_from_dicts(
unnested_additional_properties, unnested_additional_properties[0].keys()
)
# -> returns a polars DataFrame
Here are the schemas I had before. Nested:
Arrow Table schema: additional_properties: struct<availableforsale: string, description: string, handle: string, onlinestoreurl: string, primaryimageoriginalsrc: string, primaryimagesrc: string, primaryimagetransformedsrc: string, title: string>
child 0, availableforsale: string
child 1, description: string
child 2, handle: string
child 3, onlinestoreurl: string
child 4, primaryimageoriginalsrc: string
child 5, primaryimagesrc: string
child 6, primaryimagetransformedsrc: string
child 7, title: string
Unnested:
Arrow Table schema: title: string
primaryimagesrc: string
onlinestoreurl: string
primaryimageoriginalsrc: string
handle: string
availableforsale: string
primaryimagetransformedsrc: string
description: string
Yes, you are correct that the column contains different strings in the production dataset. However, only_additional_properties
and unnested_additional_properties
come from the exact same source data.
When I called it on the unnested data, it works and returns a dataframe. This is the thing I can't wrap my head around. If the issue is that one of the string columns is too large, how could it work here?
Because the nested data is different from the unnested data? That's the only possible reason I think, and from the .nbytes
you showed in an earlier comment, it is clear that this is the case?
BTW, I think you should be able to avoid the conversion of the dict to a list of tuples back to a dict to use Table.from_pydict
. The Table.from_pylist
accepts a list of row dicts (as I think you start with?), and I think that should be able to handle missing keys as well.
Describe the bug, including details regarding any error messages, version, and platform.
I was getting the following error when trying to build a large polars DataFrame from a pyarrow table:
After some digging, I discovered that this error only occurs when you call
combine_chunks
on an array that is a struct of strings, when the total size of the array is above some number of bytes. In my testing, I saw the error occur somewhere around 3,668,663,880 bytes.Previous bug reports with the offset overflow are mostly around very large strings. In this case, we don't have any individual string that is larger than 2GB. Instead, we get the error when we are above a certain total size. Here is a minimal reproduction:
When I called
combine_chunks
on the same size of data without the nesting of the struct, I did not get the error. I was able to reproduce this on MacOS Sonoma 14.6.1 as well as the python:3.12.3-slim docker image which is based on Debian 12.Python version: 3.12.2 PyArrow version: 17.0.0 Polars version: 1.7.1
The error appears to be thrown from
PutOffsets
viaConcatenate
inconcatenate.cc
https://github.com/apache/arrow/blob/apache-arrow-17.0.0/cpp/src/arrow/array/concatenate.cc#L166Component(s)
C++, Python