apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.47k stars 738 forks source link

IPC code writes files which do not include a valid stream #6311

Open bkietz opened 3 weeks ago

bkietz commented 3 weeks ago

Describe the bug

The format guarantees that each IPC file embeds a valid IPC stream in order to allow readers to ignore the Footer, skip the file's leading magic, and reuse a stream reader.

However, when writing IPC files arrow-rs aligns the encapsulated flatbuffers Messages to 64 byte boundaries instead of 8 bytes. This can leave gaps of padding bytes between the Messages which a stream reader would not know to skip.

To Reproduce

https://github.com/apache/arrow/pull/43834 adds validation of the embedded stream to arrow-c++'s arrow-json-integration-test. Running this against IPC files written by arrow-rs raises an error

$ arrow-json-integration-test -arrow datetime.arrow -json datetime.json -integration -mode VALIDATE
Error message: Invalid: Tried reading schema message, was null or length 0

Expected behavior

A stream reader should be able to read an IPC file by skipping the first 8 bytes.

Additional context

This was originally introduced in https://github.com/apache/arrow-rs/commit/eddef43d1cb46c1287da187ea1d86b0e1dc35a13 which added alignment to address new requirements around i128. However the alignment should not be applied to flatbuffers Messages; apart from the above issue I think there's no SIMD or other advantage to aligning those to more than 8 bytes. Body buffers can of course still be padded and aligned freely.

This was discovered while adding IPC file reading to nanoarrow; we were trying to defer reading Footers for a follow up and discovered that the go, rust, and javascript implementations don't embed a valid stream. Most readers have not noticed because offsets and schemas are more efficiently read from a Footer, and once acquired obviate sequential stream-style reading.

bkietz commented 3 weeks ago

I tried adding validation to arrow-rs as with https://github.com/apache/arrow/pull/43834 with validate-stream.zip

bkietz commented 3 weeks ago

@westonpace

westonpace commented 3 weeks ago

Ben and I spoke about this a bit externally. As I understand it the main benefit is that it makes it simpler to consume the IPC format (you can consume both file and streaming formats with a single streaming implementation if you don't need the random access).

I don't believe there are any drawbacks to restricting the alignment of the bytes following the magic header. I also don't envision any backwards compatibility concerns (readers that are using the flatbuffers footers can continue to accept whatever header alignment the producer is creating, the change is just to stop writing certain alignments so that readers that don't use the footer can start working)