apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.45k stars 726 forks source link

RecordBatch `get_array_memory_size` returns incorrect size if underlying buffers are shared #5969

Open HammadB opened 2 months ago

HammadB commented 2 months ago

Describe the bug

The implementation of get_array_memory_size is incorrect according to its documentation which states that it "Returns the total number of bytes of memory occupied physically by this batch." If the underlying buffers are shared in the record batch, this function will overreport the size. This can happen for example if you write to the Arrow IPC format, as when your read back, as all data is continuous in one buffer.

https://docs.rs/arrow-array/52.0.0/src/arrow_array/record_batch.rs.html#472

To Reproduce

  1. Create a record batch
  2. Write to Arrow IPC
  3. Load it, and call get_array_memory_size -> size will be off by potentially many multiples

Expected behavior

I'd expect the sizing to be the actual total size across the unique buffers in the record batch.

Additional context

tustvold commented 2 months ago

There is https://docs.rs/arrow-data/latest/arrow_data/struct.ArrayData.html#method.get_slice_memory_size that might be what you're looking for? If you create a RecordBatch from some portion of an IPC file buffer, it is unclear what is the correct value for this API to return.

We should probably better document that this is only ever going to be a best effort approximation and people should manage allocations themselves if they need accurate accounting