[Python] Pretty printing very large ChunkedArray objects can use unbounded memory

apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics

Apache License 2.0

14.59k stars 3.54k forks source link

In working on ARROW-2970, I have the following dataset:


values = [b'x'] + [
    b'x' * (1 << 20)
] * 2 * (1 << 10)

arr = np.array(values)

arrow_arr = pa.array(arr)

The object arrow_arr has 129 chunks, each element of which is 1MB of binary. The repr for this object is over 600MB:


In [10]: rep = repr(arrow_arr)

In [11]: len(rep)
Out[11]: 637536258

There's probably a number of failsafes we can implement to avoid badness in these pathological cases (which may not happen often, but given the kinds of bug reports we are seeing, people do have datasets that look like this)

Reporter: Wes McKinney / @wesm

Related issues:

PrettyPrint Improvements (is a child of)

_{Note: This issue was originally created as ARROW-4099. Please see the migration documentation for further details.}

apache / arrow

[Python] Pretty printing very large ChunkedArray objects can use unbounded memory #20692

Related issues: