Open alamb opened 1 year ago
In reviewing the arrow IPC writer code, it does appear to be clever about using offsets when actually writing (thanks @viirya in https://github.com/apache/arrow-rs/pull/2040 ❤️ ) https://github.com/apache/arrow-rs/blob/acefeef1cb5698a6afe1d3061644f6276d39117c/arrow-ipc/src/writer.rs#L1094-L1260
However, I am not sure exactly how this will translate to flight data size -- I am writing some more tests now
PR with tests showing how far from optimal the current splitting logic is: https://github.com/apache/arrow-rs/pull/3481
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Some implementations of gRPC, such as golang have a default max message size that is "relatively small" (4MB) and the clients will generate errors if they receive larger messages.
The
FlightDataEncoder
has a mechanism (link) to try and avoid this problem by heuristically slicingRecordBatch
s into smaller parts to limit their size. This works well for primitive arrays but does not work well for other cases as we have found upstream in IOx:Lists, structs, and other nested types probably suffer from similar issues with maximum message sizes.
Of course, the smallest message possible is a single row, which can always be be significantly larger than whatever the
max_flight_data_size
limit is for variable length columns (e.g. several large string columns)Describe the solution you'd like I would like to improve the situation and handle nested types and more effectively reduce the
FlightDataSize
Describe alternatives you've considered
Additional context See #3347