[Python] Pickling a sliced array serializes all the buffers

asfimport commented 4 years ago

If a large array is sliced, and pickled, it seems the full buffer is serialized, this leads to excessive memory usage and data transfer when using multiprocessing or dask.


>>> import pyarrow as pa
>>> ar = pa.array(['foo'] * 100_000)
>>> ar.nbytes
700004
>>> import pickle
>>> len(pickle.dumps(ar.slice(10, 1)))
700165

NumPy for instance
>>> import numpy as np
>>> ar_np = np.array(ar)
>>> ar_np
array(['foo', 'foo', 'foo', ..., 'foo', 'foo', 'foo'], dtype=object)
>>> import pickle
>>> len(pickle.dumps(ar_np[10:11]))
165

I think this makes sense if you know arrow, but kind of unexpected as a user.

Is there a workaround for this? For instance copy an arrow array to get rid of the offset, and trim the buffers?

Reporter: Maarten Breddels / @maartenbreddels Assignee: Clark Zinzow

Related issues:

https://github.com/apache/arrow/issues/30503 (is related to)

_{Note: This issue was originally created as ARROW-10739. Please see the migration documentation for further details.}

asfimport commented 4 years ago

Wes McKinney / @wesm: We truncate the buffers on sliced arrays when writing record batches to the IPC protocol, so the buffers should be similarly truncated in the case of pickling.

asfimport commented 4 years ago

Maarten Breddels / @maartenbreddels: Ok, good to know.

Two workarounds I came up with


%%timeit
s = pa.serialize(ar.slice(10, 1))
ar2 = pa.deserialize(s.to_buffer())
790 µs ± 578 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)


import vaex.arrow.convert

----

def trim_buffers(ar):
    '''
    >>> ar = pa.array([1, 2, 3, 4], pa.int8())
    >>> ar.nbytes
    4
    >>> ar.slice(2, 2) #doctest: +ELLIPSIS
    <pyarrow.lib.Int8Array object at 0x...>
    [
      3,
      4
    ]
    >>> ar.slice(2, 2).nbytes
    4
    >>> trim_buffers(ar.slice(2, 2)).nbytes
    2
    >>> trim_buffers(ar.slice(2, 2))#doctest: +ELLIPSIS
    <pyarrow.lib.Int8Array object at 0x...>
    [
      3,
      4
    ]
    '''
    schema = pa.schema({'x': ar.type})
    with pa.BufferOutputStream() as sink:
        with pa.ipc.new_stream(sink, schema) as writer:
            writer.write_table(pa.table({'x': ar}))
    with pa.BufferReader(sink.getvalue()) as source:
        with pa.ipc.open_stream(source) as reader:
            table = reader.read_all()
            assert table.num_columns == 1
            assert table.num_rows == len(ar)
            trimmed_ar = table.column(0)
    if isinstance(trimmed_ar, pa.ChunkedArray):
        assert len(trimmed_ar.chunks) == 1
        trimmed_ar = trimmed_ar.chunks[0]

    return trimmed_ar
----

%%timeit
vaex.arrow.convert.trim_buffers(ar.slice(10, 1))
202 µs ± 2.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

asfimport commented 4 years ago

Joris Van den Bossche / @jorisvandenbossche: Note that pyarrow.serialize is deprecated, so best not use that as a workaround

asfimport commented 3 years ago

Maarten Breddels / @maartenbreddels: Thanks Joris!

I cannot reproduce the previous timings (I guess I had an debug install without optimization), but this one seems fastest:


%%timeit
pa.concat_arrays([ar.slice(10, 1)])
2.16 µs ± 9.22 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

(vs 8 and 125 us using ipc and (de)serialize respectively)

asfimport commented 3 years ago

Wes McKinney / @wesm: Seems like this would be good to fix?

asfimport commented 3 years ago

Joris Van den Bossche / @jorisvandenbossche: That would indeed be good.

For a moment, I naively thought this would just be adding a SliceBuffer call when wrapping the buffer in _reduce_array_data (https://github.com/apache/arrow/blob/c43fab3d621bedef15470a1be43570be2026af20/python/pyarrow/array.pxi#L597). But of course, the offset and length to slice the buffer with depends on the array type and bit width, or whether it's a bitmap, etc. In the IPC code, the truncation is handled in the Visit methods of RecordBatchSerializer (eg for primitive arrays: https://github.com/apache/arrow/blob/c43fab3d621bedef15470a1be43570be2026af20/cpp/src/arrow/ipc/writer.cc#L331), and this is quite a lot of code for doing this correctly for all the different data types. Something we shouldn't start replicating in cython, I think.

Are there other utilities in C++ that can be reused to do this truncation? Or could we "just" use the IPC serialization under the hood for pickling in Python?