apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.51k stars 748 forks source link

Interoperability between arrow-rs and nanoarrow #5052

Open evgenyx00 opened 11 months ago

evgenyx00 commented 11 months ago

Which part is this question about Deserialization from arrow-rs into nanoarrow

Describe your question I’ve encountered a problem while serializing a basic Arrow object using StreamWriter with a single RecordBatch, and deserialize the object using nanoarrow, it fails while deserializing RecordBatch, due to header alignment verification in flatcc https://github.com/apache/arrow-nanoarrow/blob/d104e9065101401c63e931acdc7c10f114c47eaf/dist/flatcc.c#L2453

The alignment failure occurs in the calculation of the base offset and the offset of the union value relative to the base. I'm not fully sure, the problem is arrow-rs or flatbuffers.

Additional context Tested arrow-rs versions 5.0.0 and 47.0.0, so it's not a degradation or never worked.

Steps to reproduce:

  1. Create arrow object and save only RecordBatch bytes

  2. Test using nanoarrow or example code https://github.com/apache/arrow-nanoarrow/blob/d104e9065101401c63e931acdc7c10f114c47eaf/examples/cmake-ipc/src/app.c

  3. Reproduced on Debian 11 x86 and MacOS M1

  4. Code snippet

`

fn get_arrow_bytes() -> Vec<u8> {

    let mut buf: Vec<u8> = Vec::new();

    {

        let schema = Schema::new(vec![
            Field::new("AAAAAAAA", DataType::Utf8, true)
        ]);

        let mut arrow_writer = writer::StreamWriter::try_new(&mut buf, &schema).unwrap();

        let id_array = StringArray::from(vec!["BBBBBBBB".to_string()]);

        let batch = RecordBatch::try_new(
            Arc::new(schema),
            vec![Arc::new(id_array)]
        ).unwrap();

        arrow_writer.write(&batch).unwrap();

        arrow_writer.finish().unwrap();
    }

    buf
}

`

tustvold commented 11 months ago

My memory is admittedly a little hazy, but I definitely remember that flatbuffers do not mandate any alignment internally. I am therefore not sure why flatcc would be including this in its verification process... Is nanoarrow using it in some especially pedantic mode or something?

Edit: reviewing the linked example code it does not appear to be doing anything to guarantee the alignment of the buffer the data is being read into - https://github.com/apache/arrow-nanoarrow/blob/d104e9065101401c63e931acdc7c10f114c47eaf/examples/cmake-ipc/src/app.c#L26. Does this work with data written by other systems? If so could you perhaps provide an example of an IPC file that works and one containing the same data that doesn't?

evgenyx00 commented 11 months ago

recordbatch.tgz Attaching tgz, with two binary files recordbatch-good.bin and recordbatch-error.bin, working and non working

As well enclosing slices, used to generate binaries with RecordBatch(the working one was non rust generated )

  1. Working let buf = [ 0xff, 0xff, 0xff, 0xff, 0x98, 0x00, 0x00, 0x00, 0x14, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x0c, 0x00, 0x16, 0x00, 0x0e, 0x00, 0x15, 0x00, 0x10, 0x00, 0x04, 0x00, 0x0c, 0x00, 0x00, 0x00, 0x18, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x04, 0x00, 0x10, 0x00, 0x00, 0x00, 0x00, 0x03, 0x0a, 0x00, 0x18, 0x00, 0x0c, 0x00, 0x08, 0x00, 0x04, 0x00, 0x0a, 0x00, 0x00, 0x00, 0x14, 0x00, 0x00, 0x00, 0x48, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x10, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x08, 0x00, 0x00, 0x00, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0xff, 0xff, 0xff, 0xff, 0x00, 0x00, 0x00, 0x00, ];

  2. Non working buf

let buf = [ 0xff, 0xff, 0xff, 0xff, 0xb8, 0x00, 0x00, 0x00, 0x10, 0x00, 0x00, 0x00, 0x0c, 0x00, 0x1a, 0x00, 0x18, 0x00, 0x17, 0x00, 0x04, 0x00, 0x08, 0x00, 0x0c, 0x00, 0x00, 0x00, 0x20, 0x00, 0x00, 0x00, 0x18, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x03, 0x04, 0x00, 0x0a, 0x00, 0x14, 0x00, 0x0c, 0x00, 0x08, 0x00, 0x04, 0x00, 0x0a, 0x00, 0x00, 0x00, 0x24, 0x00, 0x00, 0x00, 0x0c, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x10, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xff, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x08, 0x00, 0x00, 0x00, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0xff, 0xff, 0xff, 0xff, 0x00, 0x00, 0x00, 0x00, ];

evgenyx00 commented 11 months ago

Not sure if it helps, in the failed condition the misalignment is 4 bytes !((k + offset_size) & ((offset_size - 1) | (align - 1u))); Where: offset_size = 4 align = 8 k = 72 base(60) + offset(12)

tustvold commented 11 months ago

The provided non-working buf doesn't even read with arrow-rs, what code did you use to produce it?

Edit: In fact neither files are valid IPC Streams AFAICT...

evgenyx00 commented 11 months ago

Apologies for not being clear, the provided non-working/working samples contains only a RecordBatch, without preceding schema, so they that can be tested by nanowarrow example(app.c), the example app doesn't iterate over all headers.

Used code to reproduce is at the begging of the thread.

Also enclosing full samples of working and non-working. arrow-samples.tgz

tustvold commented 10 months ago

Annotating the relevant RecordBatch we get

Good

0x14, 0x00, 0x00, 0x00, // Message offset (20)
0x00, 0x00, 0x00, 0x00, // Message VTable
0x0c, 0x00, // VTable length (12)
0x16, 0x00, // Object size (22)
0x0e, 0x00, // Field 0 offset (version) (14)
0x15, 0x00, // Field 1 offset (header type) (21)
0x10, 0x00, // Field 2 offset (header offset) (16)
0x04, 0x00, // Field 3 offset (bodyLength) (4)
// Message table
0x0c, 0x00, 0x00, 0x00, // SOffset to VTable (12)
0x18, 0x00, 0x00, 0x00, // (bodyLength) (24)
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // Padding
0x04, 0x00, // (version)
0x10, 0x00, 0x00, 0x00, // Offset to RecordBatch (16)
0x00, 0x03, // (header type)  (RecordBatch)
// End of Message

// RecordBatch VTable
0x0a, 0x00, // VTable length (10)
0x18, 0x00, // Object size (24)
0x0c, 0x00, // Field 0 offset (length) (12)
0x08, 0x00, // Field 1 offset (nodes) (8)
0x04, 0x00, // Field 2 offset (buffers) (4)
// RecordBatch table
0x0a, 0x00, 0x00, 0x00, // SOffset to VTable (10)
0x14, 0x00, 0x00, 0x00, // Offset to buffers
0x48, 0x00, 0x00, 0x00, // Offset to nodes (72
0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // Length
0x00, 0x00, 0x00, 0x00, //
// Buffers
0x03, 0x00, 0x00, 0x00, // Vector Length 3
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // Buffer Offset 0
0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // Buffer Length 1
0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // Buffer Offset 8
0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // Buffer Length 8
0x10, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // Buffer Offset 16
0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // Buffer Length 8
// Nodes
0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, // Vector Length 1
0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // FieldNode Length 1
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // FieldNode Null Count 0

Bad

0x10, 0x00, 0x00, 0x00, // Message offset (16)
// Message VTable
0x0c, 0x00, // VTable length (12)
0x1a, 0x00, // Object size (26)
0x18, 0x00, // Field 0 offset (version) (24)
0x17, 0x00, // Field 1 offset (header type) (23)
0x04, 0x00, // Field 2 offset (header offset) (4)
0x08, 0x00, // Field 3 offset (bodyLength) (8)
// Message table
0x0c, 0x00, 0x00, 0x00, // SOffset to VTable (12)
0x20, 0x00, 0x00, 0x00, // Offset to header (32)
0x18, 0x00, 0x00, 0x00, // (bodyLength) (8)
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x03, // (header type) (RecordBatch)
0x04, 0x00, // (version)
// End of Message

// RecordBatch VTable
0x0a, 0x00, // VTable length (10)
0x14, 0x00, // Object Size (20)
0x0c, 0x00, // Field 0 offset (length) (12)
0x08, 0x00, // Field 1 offset (nodes) (8)
0x04, 0x00, // Field 2 offset (buffers) (4)
// RecordBatch table
0x0a, 0x00, 0x00, 0x00, // SOffset to VTable (10)
0x24, 0x00, 0x00, 0x00, // Offset to buffers
0x0c, 0x00, 0x00, 0x00, // Offset to nodes
0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // Length
// Nodes
0x01, 0x00, 0x00, 0x00, // Vector length 1
0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // FieldNode length 1
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // FieldNode null count 0
// Buffers
0x03, 0x00, 0x00, 0x00, // Vector length 3
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // Buffer offset 0
0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // Buffer length 1
0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // Buffer offset 8
0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // Buffer length 8
0x10, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // Buffer offset 16
0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // Buffer length 8
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,

In the bad example we can see that the vectors don't contain padding to align them to an 8 byte boundary, instead they only have 4 byte alignment. This in turn means that the structs are not correctly aligned, which I suspect is what flatcc is then complaining about.

This appears to be an upstream bug

Edit: Filed https://github.com/google/flatbuffers/issues/8150

evgenyx00 commented 10 months ago

@tustvold appreciate your prompt assistance :)

alamb commented 1 week ago

@bkietz mentioned on https://github.com/apache/arrow-rs/pull/6449#issuecomment-2374556893:

FWIW, #6426 and google/flatbuffers#8398 should fix #5052

alamb commented 1 week ago

We have disabled the CI test in https://github.com/apache/arrow-rs/pull/6449, so as part of closing this PR we should enable the tests