Open twest820 opened 1 year ago
I had a look at the C++ code, but a bit unsure what's happening here. @pitrou - would you mind taking a look at this?
The C++ side seems ok here. flatbuf::Message::bodyLength
returns an int64 and CheckMetadataAndGetBodyLength
stores it in an int64 before comparing to 0. There is no truncation.
So it seems it's probably an issue on the C# side?
I think the problem is here. TotalLength
sums the buffer lengths for an entire record batch, but it's a 32-bit integer. WriteRecordBatchInternal
then writes it as the message's body length.
I think for sanity and robustness the C# implementation should strive to use long
for all lengths and sizes, instead of an ad hoc mix of int
s and long
s.
@CurtHagenlocher @eerhardt @chutchinson
@pitrou does this draft PR #38593 go far enough to support 2GB+ record batches, or would you expect a lot more changes, e.g. many more lengths changed from int
to long
?
@voidstar69 I have no idea, I think the only reliable to know would be to test reading such a file?
That said, I think all columnar lengths (numbers of rows and numbers of bytes) should be made long
anyway. Otherwise, other bugs are just waiting to happen.
Agree with @pitrou on testing and migrating to 64 bit sizes and offsets. From a look at the PR diffs I'd anticipate trouble at offset
and paddedLength
since those remain 32 bit—a signed 32 bit byte count on a >2 GB file is most likely a defect (barring relative offsets used in special cases).
Since C#'s maximum object length is Int32.MaxValue
, code paths using Flatbuf.RecordBatch
can't exceed 2 GB as it's 1:1 with ByteBuffer
. If Flatbuf use isn't congruent with IEnumerable<ArrowArray>
on the public RecordBatch
that might another area to look for trouble.
I did a little poking at this over the weekend, and it looks like the C# implementation can trivially be made to write large record batches correctly. In addition to the changes in #38593, I think it's sufficient to change the "Block" class contain longs instead of ints.
Reading large record batches is another matter entirely. The code in ArrowStreamReaderImplementation expects to be able to create a single memory buffer to hold the entire record batch, and this is not currently possible for a record batch whose size is > 2 GB. Of course, it's hard to test that the write of a >2GB record batch worked correctly if you can't also read it...
I think in the near term the best bet is to make write work correctly -- testing it manually by loading into another implementation -- and to make sure that reads of large record batches produce reasonable error messages.
Describe the bug, including details regarding any error messages, version, and platform.
I have some C# code using the Apache.Arrow 12.0.1 nuget to write .feather files with
ArrowFileWriter.WriteRecordBatch()
and its companions. These files are then read in R 4.2 usingread_feather()
from the arrow 12.0.0 package (Windows 10 22H2, RStudio 2023.06.1). This process works fine for files with a single record batch up to at least 2,129,214,698 bytes (1.98 GB). Much above that—the next file size up the code I'm running produces happens to be 2,176,530,466 bytes, 2.02 GB—andread_feather()
fails withThis appears to coming from CheckMetadataAndGetBodyLength(), which takes
bodyLength
as anint64_t
. So presumably what's happening here is something upstream ofCheckMetadataAndGetBodyLength()
is handlingbodyLength
as 32 bit signed and inducing integer rollover.The error message is a bit cryptic and, from a brief look at the code it's unclear how a block's
bodyLength
relates to file and record batch size. Since forcing the total (uncompressed) size of a record batch below 2 GB avoids the negative bodyLength error it appearsCheckMetadataAndGetBodyLength()
is picking up something controlled by record batch size rather than file size. At the moment I've tested up to 3.6 GB multibatch files andread_feather()
handles them fine.The Arrow spec seems to suggest compliant Arrow implementations support array lengths to 2³¹ - 1 elements. The record batch format here is 11 four byte columns, plus a one byte and a couple two bytes, so one reading of the spec is files (or maybe record batches) up to 98 GB should be compatible with the recommended limits on multi-language use.
I'm including the C++ tag on this as I'm not seeing where C# or R would be inducing rollover, suggesting it may be a lower level issue or perhaps at the managed-unmanaged boundary.
ArrowBuffer
is built onReadOnlyMemory<byte>
and therefore constrained by its int32Length
property. Not sure if that's considered to meet spec but it doesn't appear it should be causing rollover here—the 2.02 GB file is only 44M records, so not more than 170 MB per column even when everything's in oneRecordBatch
—and the unchecked int32 conversion inArrowMemoryReaderImplementation.ReadNextRecordBatch()
probably isn't involved either. Since I'm not getting an exception at write the checked C# conversions fromlong
can be excluded.Component(s)
C#, C++, R