apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.59k stars 3.54k forks source link

[Python][C++] Calling Table.from_pandas with a dataframe that contains a map column of sufficient size causes SIGABRT and process crash #44643

Open snakingfire opened 1 week ago

snakingfire commented 1 week ago

Describe the bug, including details regarding any error messages, version, and platform.

Related to https://github.com/apache/arrow/issues/44640

When attempting to convert a pandas dataframe that has a dict type column to a pyarrow table with a map column, if the dataframe and column are of sufficient size, the conversion fails with:

/.../arrow/cpp/src/arrow/array/builder_nested.cc:103:  Check failed: (item_builder_->length()) == (key_builder_->length()) keys and items builders don't have the same size in MapBuilder

This is immediately followed by SIGABRT and the process crashing.

When the dataframe is of a smaller size, the conversion succeeds without error. See below for reproduction code, when dataframe_size is set to a small value (eg 1M rows) there is no error, but at a certain size (eg, 10M rows) the error condition occurs.

import pandas as pd
import pyarrow

# Example DataFrame creation
import numpy as np
import random
import string

dataframe_size = 10_000_000

map_keys = [
    "a1B2c3D4e5",
    "f6G7h8I9j0",
    "k1L2m3N4o5",
    "p6Q7r8S9t0",
    "u1V2w3X4y5",
    "z6A7b8C9d0",
    "e1F2g3H4i5",
    "j6K7l8M9n0",
    "o1P2q3R4s5",
    "t6U7v8W9x0",
    "y1Z2a3B4c5",
    "d6E7f8G9h0",
    "i1J2k3L4m5",
    "n6O7p8Q9r0",
    "s1T2u3V4w5",
]

# Pre-generate random strings for columns to avoid repeated computation
print("Generating random column strings")
random_strings = [
    "".join(random.choices(string.ascii_letters + string.digits, k=20))
    for _ in range(int(dataframe_size / 100))
]

# Pre-generate random map values
print("Generating random map value strings")
random_map_values = [
    "".join(
        random.choices(
            string.ascii_letters + string.digits, k=random.randint(20, 200)
        )
    )
    for _ in range(int(dataframe_size / 100))
]

print("Generating random maps")
random_maps = [
    {
        key: random.choice(random_map_values)
        for key in random.sample(map_keys, random.randint(5, 10))
    }
    for _ in range(int(dataframe_size / 100))
]

print("Generating random dataframe")
data_with_map_col = {
    "partition": np.full(dataframe_size, "1"),
    "column1": np.random.choice(random_strings, dataframe_size),
    "map_col": np.random.choice(random_maps, dataframe_size),
}

# Create DataFrame
df_with_map_col = pd.DataFrame(data_with_map_col)

column_types = {
    "partition": pyarrow.string(),
    "column1": pyarrow.string(),
    "map_col": pyarrow.map_(pyarrow.string(), pyarrow.string()),
}
schema = pyarrow.schema(fields=column_types)

# Process crashes when dataframe is large enough
table = pyarrow.Table.from_pandas(
    df=df_with_map_col, schema=schema, preserve_index=False, safe=True
)

Environment Details:

Component(s)

Python

pitrou commented 4 days ago

cc @jorisvandenbossche @raulcd

raulcd commented 4 hours ago

really weird:

item_builder_->length()   19456153
key_builder_->length():   19456154

I'll have to debug a bunch to understand where this mismatch is coming from :)

pitrou commented 3 hours ago

Intuitively, I think what happens is that the item_builder_ overflows because it's a StringBuilder and we try to append more than 2 GiB to it. The converter logic then tries to finish the chunk and start another one, but the key and item builders are out of sync.

pitrou commented 3 hours ago

It looks like the rewind-on-overflow in arrow/util/converter.h is too naive. In particular, if appending to one of a StructBuilder's child builders raises CapacityError, then all child builders should be rewind to the same length to ensure consistency.