apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.59k stars 3.54k forks source link

[Python] Pretty printing very large ChunkedArray objects can use unbounded memory #20692

Open asfimport opened 5 years ago

asfimport commented 5 years ago

In working on ARROW-2970, I have the following dataset:


values = [b'x'] + [
    b'x' * (1 << 20)
] * 2 * (1 << 10)

arr = np.array(values)

arrow_arr = pa.array(arr)

The object arrow_arr has 129 chunks, each element of which is 1MB of binary. The repr for this object is over 600MB:


In [10]: rep = repr(arrow_arr)

In [11]: len(rep)
Out[11]: 637536258

There's probably a number of failsafes we can implement to avoid badness in these pathological cases (which may not happen often, but given the kinds of bug reports we are seeing, people do have datasets that look like this)

Reporter: Wes McKinney / @wesm

Related issues:

Note: This issue was originally created as ARROW-4099. Please see the migration documentation for further details.

asfimport commented 5 years ago

Wes McKinney / @wesm: What we probably need to do is implement a global size bound on the output of PrettyPrint so that we bail out early when we hit a particular limit (e.g. around a megabyte or so). This is a pretty significant refactor of src/arrow/pretty_print.cc since there are many functions that write directly into std::ostream without any size book-keeping. This isn't causing enough of a user problem to require us to fix it right now