Open JasonLi-cn opened 4 months ago
In newer versions of DataFusion I would expect that grouping on a string column would not use row format, but instead usse the special GroupValuesBytes:
The bug you describe certainly can happen if there are large numbers of distinct large strings in a multi-column group 🤔
FWIW in general offset overflows do yield panics in arrow-rs, the additional plumbing for error handling what is almost always an unrecoverable error has been hard to justify, although I suspect in this case it could be made into an ArrowError
without it being a breaking change.
Edit: I've updated this to be an enhancement, panics are not a bug
The bug you describe certainly can happen if there are large numbers of distinct large strings in a multi-column group 🤔
Yes, this problem was first discovered in the case of a group by multi-column.
FWIW in general offset overflows do yield panics in arrow-rs, the additional plumbing for error handling what is almost always an unrecoverable error has been hard to justify, although I suspect in this case it could be made into an
ArrowError
without it being a breaking change.Edit: I've updated this to be an enhancement, panics are not a bug
Do you mean we need to add a new function like the following?
pub fn try_decode_binary<I: OffsetSizeTrait>(
rows: &mut [&[u8]],
options: SortOptions,
) -> Result<GenericBinaryArray<I>, ArrowError> {
...
}
Do you mean we need to add a new function like the following?
Maybe you could add a check in your code (or in datafusion 🤔 ) on the size of the string buffer and make a new record batch if they exceed 2GB or something. This might be related: https://github.com/apache/datafusion/issues/9562
Making a single array with more than 2GB of string data is likely to be non ideal in a bunch of ways
Describe the bug
Datafusion Table Info: Having a
http_url
column, which DataType isUtf8
, and it has a lot of distinct values.Datafusion SQL
Panic Info
Reason
GroupValuesRows
in Datafusion stores rows usingRows
, and this bug may be triggered when callingemit
function. https://github.com/apache/arrow-rs/blob/49e714de6e951169d0d5e73381af247ad0230fcf/arrow-row/src/variable.rs#L217-L226In the extreme case, if I
append
twoUtf8
values which size are large(len1 + len2 > i32::MAX) into Rows twice, and then callconvert_rows
, which should also trigger the bug. 🤔To Reproduce
Expected behavior
Additional context