apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.32k stars 3.48k forks source link

[Python][C++] Offset Overflow when calling combine_chunks on Large Struct Arrays #44164

Open Aryik opened 6 days ago

Aryik commented 6 days ago

Describe the bug, including details regarding any error messages, version, and platform.

I was getting the following error when trying to build a large polars DataFrame from a pyarrow table:

  File "/app/decision_engine/loader.py", line 897, in dataframe_from_dicts
    pl.from_arrow(arrow_chunks, schema=schema),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/pypoetry/virtualenvs/decision-engine-9TtSrW0h-py3.12/lib/python3.12/site-packages/polars/convert/general.py", line 462, in from_arrow
    arrow_to_pydf(
  File "/root/.cache/pypoetry/virtualenvs/decision-engine-9TtSrW0h-py3.12/lib/python3.12/site-packages/polars/_utils/construction/dataframe.py", line 1195, in arrow_to_pydf
    ps = plc.arrow_to_pyseries(name, column, rechunk=rechunk)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/pypoetry/virtualenvs/decision-engine-9TtSrW0h-py3.12/lib/python3.12/site-packages/polars/_utils/construction/series.py", line 421, in arrow_to_pyseries
    pys = PySeries.from_arrow(name, array.combine_chunks())
                                    ^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/table.pxi", line 754, in pyarrow.lib.ChunkedArray.combine_chunks
  File "pyarrow/array.pxi", line 4579, in pyarrow.lib.concat_arrays
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays

After some digging, I discovered that this error only occurs when you call combine_chunks on an array that is a struct of strings, when the total size of the array is above some number of bytes. In my testing, I saw the error occur somewhere around 3,668,663,880 bytes.

Previous bug reports with the offset overflow are mostly around very large strings. In this case, we don't have any individual string that is larger than 2GB. Instead, we get the error when we are above a certain total size. Here is a minimal reproduction:

def get_arrow_table_and_chunks(
    dicts: List[Dict[str, Any]],
    schema_keys: List[str],
) -> pl.DataFrame:
    normalized_data = [
        {key: row.get(key, None) for key in schema_keys} for row in dicts
    ]

    # Convert to Arrow Table
    arrow_table = pa.Table.from_pydict(
        {key: [row[key] for row in normalized_data] for key in schema_keys}
    )

    print(f"Arrow Table schema: {arrow_table.schema}")

    # Provide chunks of the Arrow Table to polars. If we have too much data in a single chunk,
    # we get these strange errors:
    # pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
    arrow_chunks = arrow_table.to_batches(max_chunksize=10000)

    return arrow_table, arrow_chunks

n_bytes_per_row = 14
desired_bytes = 3_668_663_888
n_rows = desired_bytes // n_bytes_per_row
placeholder_string = "1"*10
# Careful - this uses ~150GB of memory and takes a long time
data = [ { "i": placeholder_string, "nested": { "nested_string": placeholder_string } } for _ in range(n_rows) ]
table, chunks = get_arrow_table_and_chunks(data, ["i", "nested"])
# Output:
# Arrow Table schema: i: string
# nested: struct<nested_string: string>
#   child 0, nested_string: string
table["nested"].nbytes
# Output: 3,668,663,880
table["nested"].combine_chunks()
# ---------------------------------------------------------------------------
# ArrowInvalid                              Traceback (most recent call last)
# Cell In[12], line 1
# ----> 1 table["nested"].combine_chunks()
# File ~/Library/Caches/pypoetry/virtualenvs/decision-engine-LFRtXiTt-py3.12/lib/python3.12/site-packages/pyarrow/table.pxi:754, in pyarrow.lib.ChunkedArray.combine_chunks()
# File ~/Library/Caches/pypoetry/virtualenvs/decision-engine-LFRtXiTt-py3.12/lib/python3.12/site-packages/pyarrow/array.pxi:4579, in pyarrow.lib.concat_arrays()
# File ~/Library/Caches/pypoetry/virtualenvs/decision-engine-LFRtXiTt-py3.12/lib/python3.12/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()
# File ~/Library/Caches/pypoetry/virtualenvs/decision-engine-LFRtXiTt-py3.12/lib/python3.12/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()
# ArrowInvalid: offset overflow while concatenating arrays

When I called combine_chunks on the same size of data without the nesting of the struct, I did not get the error. I was able to reproduce this on MacOS Sonoma 14.6.1 as well as the python:3.12.3-slim docker image which is based on Debian 12.

Python version: 3.12.2 PyArrow version: 17.0.0 Polars version: 1.7.1

The error appears to be thrown from PutOffsets via Concatenate in concatenate.cc https://github.com/apache/arrow/blob/apache-arrow-17.0.0/cpp/src/arrow/array/concatenate.cc#L166

Component(s)

C++, Python

pitrou commented 5 days ago

Previous bug reports with the offset overflow are mostly around very large strings. In this case, we don't have any individual string that is larger than 2GB. Instead, we get the error when we are above a certain total size.

This is expected anyway. The binary and string types in Arrow store the offsets inside the data, so a string array with a total size greater than 2 GiB is not possible.

You should either keep the chunks separate (i.e. don't call combine_chunks) or first convert them your string column to large_string (which uses 64-bit offsets).

Aryik commented 5 days ago

The call to combine_chunks works when they are top-level string arrays and only fails when they're nested inside a struct. Is that expected?

pitrou commented 5 days ago

Can you show the result of combining the top-level string arrays?

jorisvandenbossche commented 4 days ago

@Aryik if it are the same strings, then combining the chunks in the non-nested case should also error.

Could you provide a reproducible example for that? (and ideally one that doesn't require 150GB memory to create it, as the actual data should only need to be a bit bigger than 2GB to trigger the error).

Example showing an error in both cases:

import string
import numpy as np

import pyarrow as pa
import pyarrow.compute as pc

data = ["".join(np.random.choice(list(string.ascii_letters), n)) for n in np.random.randint(10, 500, size=10_000)]

# will create a chunked array because data don't fit in a single array
chunked_array = pa.array(data * 1000)
nested_chunked_array = pc.make_struct(chunked_array, field_names=["string_field"])

with this example (where the string data > 2GB), I get the error for both plain string type as for nested in a struct:

In [24]: nested_chunked_array.nbytes
Out[24]: 2596156000

In [25]: nested_chunked_array.combine_chunks()
...
ArrowInvalid: offset overflow while concatenating arrays

In [26]: chunked_array.combine_chunks()
...
ArrowInvalid: offset overflow while concatenating arrays, consider casting input from `string` to `large_string` first.
Aryik commented 4 days ago

This isn't a reproducible example because it came from our production source data, but here's some evidence for the difference in behavior with nested vs. non-nested:

# additional_properties is a nested dict that contains strings
only_additional_properties = [
    {"additional_properties": item["additional_properties"]} for item in test_items
]
unnested_additional_properties = [item["additional_properties"] for item in test_items]
nested_table, nested_chunks = get_arrow_table_and_chunks(
    only_additional_properties, ["additional_properties"]
)
unnested_table, unnested_chunks = get_arrow_table_and_chunks(
    unnested_additional_properties, unnested_additional_properties[0].keys()
)
nested_table.nbytes
# -> 3668663888
unnested_table.nbytes
# -> 3668663888
nested_table["additional_properties"].combine_chunks()
# ArrowInvalid: offset overflow ...

# This is one of the columns inside "additional_properties"
unnested_table["primaryimagesrc"].nbytes  
# -> 349,666,429

# This works 
unnested_table["primaryimagesrc"].combine_chunks()

I will try to come up with a reproducible example

jorisvandenbossche commented 4 days ago

Can you show the schema of both nested_table and unnested_table?

jorisvandenbossche commented 4 days ago

Also unnested_table["primaryimagesrc"].nbytes shows a result that is less than 2GB, so in that case it can indeed fit in a single chunk. But so that means that that column contains different strings?

Aryik commented 3 days ago

After trying some variations, I cannot create a reproducible example, but I'm struggling to explain the behavior I saw with our production data.

Here's my attempt at repro, which passes without an exception:

import string

import numpy as np
import pyarrow as pa

data = [
    "".join(np.random.choice(list(string.ascii_letters), n))
    for n in np.random.randint(10, 500, size=10_000)
]

nested_table = pa.Table.from_pydict(
    # 10x the nested data, because I thought the issue was related to the nesting and the unnested
    # fails with data * 1000
    {"nested": [{f"string_field_{i}": s for i in range(10)} for s in data * 100]}
)

unnested_table = pa.Table.from_pydict({"string_field": data * 100})

print(
    f"Trying to combine chunks of unnested table with nbytes {unnested_table["string_field"].nbytes}"
)
unnested_table["string_field"].combine_chunks()

print(
    f"Trying to combine chunks of nested table with nbytes {nested_table["nested"].nbytes}"
)
nested_table["nested"].combine_chunks()

In production, we have this function:

def dataframe_from_dicts(
    dicts: List[Dict[str, Any]],
    schema: HtSchemaDict | List[str],
) -> pl.DataFrame:
    """Convert a list of dictionaries into a DataFrame with the specified schema.
    Args:
        dicts (List[dict]): The list of dictionaries to convert to a DataFrame.
        schema (HtSchemaDict or List[str]): The schema of the DataFrame. This supports either a dictionary of the schema
        or a list of the schema keys if we don't know all the schema values and the values will be inferred.
    Returns:
        pl.DataFrame: The DataFrame.
    """

    schema_keys = list(schema.keys() if isinstance(schema, dict) else schema)

    # Convert to list of tuples. We need to provide a None value for any missing keys in the schema
    # so the schema will fill in the missing values with nulls in our model transformations.
    normalized_data = [
        {key: row.get(key, None) for key in schema_keys} for row in dicts
    ]

    # Convert to Arrow Table
    arrow_table = pa.Table.from_pydict(
        {key: [row[key] for row in normalized_data] for key in schema_keys}
    )

    # Provide chunks of the Arrow Table to polars. If we have too much data in a single chunk,
    # we get these strange errors:
    # pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
    arrow_chunks = arrow_table.to_batches(max_chunksize=10000)

    # Create Polars DataFrame from Arrow Table
    return cast(
        pl.DataFrame,
        # Allow passing in an empty schema so that it can be inferred.
        pl.from_arrow(arrow_chunks, schema=schema),  # Throwing an offset error
    )

When I tried to call this on the nested data, it failed:

dataframe_from_dicts(only_additional_properties, ["additional_properties"])
# ArrowInvalid: offset overflow while concatenating arrays

When I called it on the unnested data, it works and returns a dataframe. This is the thing I can't wrap my head around. If the issue is that one of the string columns is too large, how could it work here?

dataframe_from_dicts(
    unnested_additional_properties, unnested_additional_properties[0].keys()
)
# -> returns a polars DataFrame

Here are the schemas I had before. Nested:

Arrow Table schema: additional_properties: struct<availableforsale: string, description: string, handle: string, onlinestoreurl: string, primaryimageoriginalsrc: string, primaryimagesrc: string, primaryimagetransformedsrc: string, title: string>
  child 0, availableforsale: string
  child 1, description: string
  child 2, handle: string
  child 3, onlinestoreurl: string
  child 4, primaryimageoriginalsrc: string
  child 5, primaryimagesrc: string
  child 6, primaryimagetransformedsrc: string
  child 7, title: string

Unnested:

Arrow Table schema: title: string
primaryimagesrc: string
onlinestoreurl: string
primaryimageoriginalsrc: string
handle: string
availableforsale: string
primaryimagetransformedsrc: string
description: string

Yes, you are correct that the column contains different strings in the production dataset. However, only_additional_properties and unnested_additional_properties come from the exact same source data.

jorisvandenbossche commented 1 day ago

When I called it on the unnested data, it works and returns a dataframe. This is the thing I can't wrap my head around. If the issue is that one of the string columns is too large, how could it work here?

Because the nested data is different from the unnested data? That's the only possible reason I think, and from the .nbytes you showed in an earlier comment, it is clear that this is the case?

BTW, I think you should be able to avoid the conversion of the dict to a list of tuples back to a dict to use Table.from_pydict. The Table.from_pylist accepts a list of row dicts (as I think you start with?), and I think that should be able to handle missing keys as well.