apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.31k stars 3.48k forks source link

[C++] Reduce memory usage when writing to IPC #31528

Open asfimport opened 2 years ago

asfimport commented 2 years ago

Writing a record batch to IPC ([header][buffers]) currently requires O(N*B) where N is the average size of the buffer and B the number of buffers in the recordbatch.

This is because we need the buffer location and total number of bytes to write the header of the record, which is only known after e.g. knowning by how much the buffers were compressed.

When the writer supports seeking, this memory usage can be reduced to O(N) where N is the average size of a primitive buffer over all fields. This is done using the following pseudo-code implementation:


start = writer.seek(current);
empty_locations = create_empty_header(schema)
write_header(writer, empty_locations)
locations = write_buffers(writer, batch)
writer.seek(start)
write_header(writer, locations)

This has a significantly lower memory footprint. O(N) vs O(N*B)

It could be interesting for the C++ implementation to support this.

Reporter: Jorge Leitão / @jorgecarleitao

Note: This issue was originally created as ARROW-16118. Please see the migration documentation for further details.

asfimport commented 2 years ago

Weston Pace / @westonpace: One note is that this will only be possible when using a local filesystem. Object stores usually do not support modification of files after they have been uploaded. So we might need to add a flag to a filesystem to ask whether it supports this but that can be useful. On spinning disk filesystems (i.e. HDD) the seek penalty may cause this to be less (or potentially negative) of a benefit if there is not much RAM pressure.

However, I could see this being useful in cases where we are writing to a very fast solid state drive.

Another memory-usage issue that might be related since we are talking about writes is ARROW-14635. I'd like to move to direct I/O (or at least being smarter about OS page cache) when writing large datasets since we otherwise end up causing the system to swap for no good reason.

asfimport commented 2 years ago

Micah Kornfield / @emkornfield: Also, we should be careful how this enabled, since if someone is actually consuming the stream in real-time there would need to be some sort of coordination to ensure bytes aren't sent prematurely.