apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.48k stars 3.52k forks source link

[C#][C++] read_feather() fails with IOError: Invalid IPC message: negative bodyLength on batches larger than 2 GB #37069

Open twest820 opened 1 year ago

twest820 commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

I have some C# code using the Apache.Arrow 12.0.1 nuget to write .feather files with ArrowFileWriter.WriteRecordBatch() and its companions. These files are then read in R 4.2 using read_feather() from the arrow 12.0.0 package (Windows 10 22H2, RStudio 2023.06.1). This process works fine for files with a single record batch up to at least 2,129,214,698 bytes (1.98 GB). Much above that—the next file size up the code I'm running produces happens to be 2,176,530,466 bytes, 2.02 GB—and read_feather() fails with

Error: IOError: Invalid IPC message: negative bodyLength

This appears to coming from CheckMetadataAndGetBodyLength(), which takes bodyLength as an int64_t. So presumably what's happening here is something upstream of CheckMetadataAndGetBodyLength() is handling bodyLength as 32 bit signed and inducing integer rollover.

The error message is a bit cryptic and, from a brief look at the code it's unclear how a block's bodyLength relates to file and record batch size. Since forcing the total (uncompressed) size of a record batch below 2 GB avoids the negative bodyLength error it appears CheckMetadataAndGetBodyLength() is picking up something controlled by record batch size rather than file size. At the moment I've tested up to 3.6 GB multibatch files and read_feather() handles them fine.

The Arrow spec seems to suggest compliant Arrow implementations support array lengths to 2³¹ - 1 elements. The record batch format here is 11 four byte columns, plus a one byte and a couple two bytes, so one reading of the spec is files (or maybe record batches) up to 98 GB should be compatible with the recommended limits on multi-language use.

I'm including the C++ tag on this as I'm not seeing where C# or R would be inducing rollover, suggesting it may be a lower level issue or perhaps at the managed-unmanaged boundary. ArrowBuffer is built on ReadOnlyMemory<byte> and therefore constrained by its int32 Length property. Not sure if that's considered to meet spec but it doesn't appear it should be causing rollover here—the 2.02 GB file is only 44M records, so not more than 170 MB per column even when everything's in one RecordBatch—and the unchecked int32 conversion in ArrowMemoryReaderImplementation.ReadNextRecordBatch() probably isn't involved either. Since I'm not getting an exception at write the checked C# conversions from long can be excluded.

Component(s)

C#, C++, R

thisisnic commented 1 year ago

I had a look at the C++ code, but a bit unsure what's happening here. @pitrou - would you mind taking a look at this?

pitrou commented 1 year ago

The C++ side seems ok here. flatbuf::Message::bodyLength returns an int64 and CheckMetadataAndGetBodyLength stores it in an int64 before comparing to 0. There is no truncation.

So it seems it's probably an issue on the C# side?

pitrou commented 1 year ago

I think the problem is here. TotalLength sums the buffer lengths for an entire record batch, but it's a 32-bit integer. WriteRecordBatchInternal then writes it as the message's body length.

I think for sanity and robustness the C# implementation should strive to use long for all lengths and sizes, instead of an ad hoc mix of ints and longs.

@CurtHagenlocher @eerhardt @chutchinson

voidstar69 commented 11 months ago

@pitrou does this draft PR #38593 go far enough to support 2GB+ record batches, or would you expect a lot more changes, e.g. many more lengths changed from int to long?

pitrou commented 11 months ago

@voidstar69 I have no idea, I think the only reliable to know would be to test reading such a file?

That said, I think all columnar lengths (numbers of rows and numbers of bytes) should be made long anyway. Otherwise, other bugs are just waiting to happen.

twest820 commented 11 months ago

Agree with @pitrou on testing and migrating to 64 bit sizes and offsets. From a look at the PR diffs I'd anticipate trouble at offset and paddedLength since those remain 32 bit—a signed 32 bit byte count on a >2 GB file is most likely a defect (barring relative offsets used in special cases).

Since C#'s maximum object length is Int32.MaxValue, code paths using Flatbuf.RecordBatch can't exceed 2 GB as it's 1:1 with ByteBuffer. If Flatbuf use isn't congruent with IEnumerable<ArrowArray> on the public RecordBatch that might another area to look for trouble.

CurtHagenlocher commented 10 months ago

I did a little poking at this over the weekend, and it looks like the C# implementation can trivially be made to write large record batches correctly. In addition to the changes in #38593, I think it's sufficient to change the "Block" class contain longs instead of ints.

Reading large record batches is another matter entirely. The code in ArrowStreamReaderImplementation expects to be able to create a single memory buffer to hold the entire record batch, and this is not currently possible for a record batch whose size is > 2 GB. Of course, it's hard to test that the write of a >2GB record batch worked correctly if you can't also read it...

I think in the near term the best bet is to make write work correctly -- testing it manually by loading into another implementation -- and to make sure that reads of large record batches produce reasonable error messages.