apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.64k stars 809 forks source link

Bug(arrow-row): calling `convert_raw` function cause "offset overflow" panic #6112

Open JasonLi-cn opened 4 months ago

JasonLi-cn commented 4 months ago

Describe the bug

select http_url from tab group http_url

In the extreme case, if I append two Utf8 values which size are large(len1 + len2 > i32::MAX) into Rows twice, and then call convert_rows, which should also trigger the bug. 🤔

To Reproduce

Expected behavior

Additional context

alamb commented 4 months ago

In newer versions of DataFusion I would expect that grouping on a string column would not use row format, but instead usse the special GroupValuesBytes:

https://github.com/apache/datafusion/blob/7db4213b71ed9e914c5a4f16954abfa20b091ae3/datafusion/physical-plan/src/aggregates/group_values/bytes.rs#L27

alamb commented 4 months ago

The bug you describe certainly can happen if there are large numbers of distinct large strings in a multi-column group 🤔

tustvold commented 4 months ago

FWIW in general offset overflows do yield panics in arrow-rs, the additional plumbing for error handling what is almost always an unrecoverable error has been hard to justify, although I suspect in this case it could be made into an ArrowError without it being a breaking change.

Edit: I've updated this to be an enhancement, panics are not a bug

JasonLi-cn commented 4 months ago

The bug you describe certainly can happen if there are large numbers of distinct large strings in a multi-column group 🤔

Yes, this problem was first discovered in the case of a group by multi-column.

JasonLi-cn commented 4 months ago

FWIW in general offset overflows do yield panics in arrow-rs, the additional plumbing for error handling what is almost always an unrecoverable error has been hard to justify, although I suspect in this case it could be made into an ArrowError without it being a breaking change.

Edit: I've updated this to be an enhancement, panics are not a bug

Do you mean we need to add a new function like the following?

pub fn try_decode_binary<I: OffsetSizeTrait>(
    rows: &mut [&[u8]],
    options: SortOptions,
) -> Result<GenericBinaryArray<I>, ArrowError> {
    ...
}
alamb commented 4 months ago

Do you mean we need to add a new function like the following?

Maybe you could add a check in your code (or in datafusion 🤔 ) on the size of the string buffer and make a new record batch if they exceed 2GB or something. This might be related: https://github.com/apache/datafusion/issues/9562

Making a single array with more than 2GB of string data is likely to be non ideal in a bunch of ways