Closed andygrove closed 5 months ago
@andygrove Is there any reproducible example?
Btw, it doesn't fail in shuffle writer.
From the stack trace, it looks like when the writer to pull next batch from its upstream, the error happened.
@andygrove Is there any reproducible example?
I was running TPC-H @ 100gb
It failed on q10. I am adding debug logging to see if I can track this down.
The error is happening in unsafe code in arrow-rs.
Here is some debug output showing the calls leading up the the error:
copy_or_cast_array() len=8192, type=Utf8
copy_array(typeUtf8, len=8192) before mutable.extend(0, 0, 8192) data len = 8192
getNextBatch id=2284, plan=135761185342032 failed during native execution: range end index 294912 out of range for slice of length 147456
Note that there are many earlier calls that look identical and do not fail.
The error happens in arrow_data::transform::variable_size::build_extend
:
at core::panicking::panic_fmt(__internal__:0)
at core::slice::index::slice_end_index_len_fail(__internal__:0)
at arrow_data::transform::variable_size::build_extend::{{closure}}(__internal__:0)
at arrow_data::transform::MutableArrayData::extend(__internal__:0)
at comet::execution::operators::copy_array(__internal__:0)
at comet::execution::operators::copy_or_cast_array(__internal__:0)
This function calls get_last_offset
to get the last offset and there are some documented assumptions. I wonder if we are violating any of these?
pub(super) unsafe fn get_last_offset<T: ArrowNativeType>(offset_buffer: &MutableBuffer) -> T {
// JUSTIFICATION
// Benefit
// 20% performance improvement extend of variable sized arrays (see bench `mutable_array`)
// Soundness
// * offset buffer is always extended in slices of T and aligned accordingly.
// * Buffer[0] is initialized with one element, 0, and thus `mutable_offsets.len() - 1` is always valid.
let (prefix, offsets, suffix) = offset_buffer.as_slice().align_to::<T>();
debug_assert!(prefix.is_empty() && suffix.is_empty());
*offsets.get_unchecked(offsets.len() - 1)
}
I ran with a debug build and got a more detailed stack trace:
at core::panicking::panic_fmt(/rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/core/src/panicking.rs:72)
at core::slice::index::slice_end_index_len_fail_rt(/rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/core/src/slice/index.rs:76)
at core::slice::index::slice_end_index_len_fail(/rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/core/src/slice/index.rs:68)
at <core::ops::range::Range<usize> as core::slice::index::SliceIndex<[T]>>::index(/rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/core/src/slice/index.rs:394)
at core::slice::index::<impl core::ops::index::Index<I> for [T]>::index(/rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/core/src/slice/index.rs:18)
at arrow_data::transform::variable_size::extend_offset_values(/home/andy/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-data-52.0.0/src/transform/variable_size.rs:38)
at arrow_data::transform::variable_size::build_extend::{{closure}}(/home/andy/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-data-52.0.0/src/transform/variable_size.rs:57)
at <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call(/rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/alloc/src/boxed.rs:2029)
at arrow_data::transform::MutableArrayData::extend(/home/andy/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-data-52.0.0/src/transform/mod.rs:649)
at comet::execution::operators::copy_array(/home/andy/git/apache/datafusion-comet/core/src/execution/operators/mod.rs:70)
The error happens in extend_offset_values
on this line:
let new_values = &values[start_values..end_values];
Here is some more debug info, showing the size of the buffers in the array data and the value in the first element of each buffer:
copy_array(typeInt64, len=8192) after mutable.extend(0, 0, 8192) data len = 8192
copy_or_cast_array() len=8192, type=Utf8
copy_array num buffers=2
BUF[0] len=32772 cap=32772
buf[0][0] = 0
BUF[1] len=147456 cap=147456
buf[1][0] = 67
getNextBatch id=2284, plan=126232496596768 failed during native execution: range end index 294912 out of range for slice of length 147456
Ok. I tried TPC-H @ 100gb and TPC-H @ 1gb. Only TPC-H @ 100gb can reproduce it. I will look into this.
I got some hint (I know which operator causes it) after debugging it for a while, though I don't get the root cause yet.
I've just figured out where the root cause is. I will go to propose a fix to arrow-rs.
EDIT: actually the issue is happened in Java Arrow instead of arrow-rs.
I opened an issue at Java Arrow repo and described the root cause: https://github.com/apache/arrow-java/issues/74. Fixing it there might wait for a longer release period. I'm thinking to copy BufferImportTypeVisitor
into Comet and fix in Comet.
Describe the bug
I just saw this when running benchmarks with latest from main and with xxhash64 disabled.
Steps to reproduce
No response
Expected behavior
No response
Additional context
No response