apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.63k stars 3.56k forks source link

[IPC] Can we serialize just arrays or scalars? #44736

Open JerAguilon opened 1 week ago

JerAguilon commented 1 week ago

Describe the usage question you have. Please include as many useful details as possible.

I have a use case where I want to send a fairly small arrays/scalars via Arrow IPC. Memory efficiency is quite important for us.

One simple way of doing this would be to put the ints in a Record batch and serialize.

However, I notice that serializing an int, 3 chars, and a bool takes a whopping 512 bytes on my machine:

https://gist.github.com/JerAguilon/8c70107a576c0b04b5846ad8207651e7

STDOUT:
batch: pyarrow.RecordBatch
a: int64
b: string
c: bool, num_rows: 1
buf.size: 512

Things like the schema are well known ahead of time, I'd like to minimize as much data in the payload as possible.

In Arrow IPC (the C++ library works for me), is there some way to serialize just a scalar or array? I see only a RecordBatchWriter exposed:

https://arrow.apache.org/docs/cpp/api/ipc.html#_CPPv416MakeStreamWriterPN2io12OutputStreamERKNSt10shared_ptrI6SchemaEERK15IpcWriteOptions

Component(s)

C++