Open asfimport opened 2 years ago
Weston Pace / @westonpace: One note is that this will only be possible when using a local filesystem. Object stores usually do not support modification of files after they have been uploaded. So we might need to add a flag to a filesystem to ask whether it supports this but that can be useful. On spinning disk filesystems (i.e. HDD) the seek penalty may cause this to be less (or potentially negative) of a benefit if there is not much RAM pressure.
However, I could see this being useful in cases where we are writing to a very fast solid state drive.
Another memory-usage issue that might be related since we are talking about writes is ARROW-14635. I'd like to move to direct I/O (or at least being smarter about OS page cache) when writing large datasets since we otherwise end up causing the system to swap for no good reason.
Micah Kornfield / @emkornfield: Also, we should be careful how this enabled, since if someone is actually consuming the stream in real-time there would need to be some sort of coordination to ensure bytes aren't sent prematurely.
Writing a record batch to IPC ([header][buffers]) currently requires O(N*B) where N is the average size of the buffer and B the number of buffers in the recordbatch.
This is because we need the buffer location and total number of bytes to write the header of the record, which is only known after e.g. knowning by how much the buffers were compressed.
When the writer supports seeking, this memory usage can be reduced to O(N) where N is the average size of a primitive buffer over all fields. This is done using the following pseudo-code implementation:
This has a significantly lower memory footprint. O(N) vs O(N*B)
It could be interesting for the C++ implementation to support this.
Reporter: Jorge Leitão / @jorgecarleitao
Note: This issue was originally created as ARROW-16118. Please see the migration documentation for further details.