apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
830 stars 164 forks source link

range end index 294912 out of range for slice of length 147456 #540

Closed andygrove closed 5 months ago

andygrove commented 5 months ago

Describe the bug

I just saw this when running benchmarks with latest from main and with xxhash64 disabled.

Caused by: org.apache.comet.CometNativeException: General execution error with reason org.apache.comet.CometNativeException: range end index 294912 out of range for slice of length 147456
        at comet::errors::init::{{closure}}(__internal__:0)
        at std::panicking::rust_panic_with_hook(__internal__:0)
        at std::panicking::begin_panic_handler::{{closure}}(__internal__:0)
        at std::sys_common::backtrace::__rust_end_short_backtrace(__internal__:0)
        at rust_begin_unwind(__internal__:0)
        at core::panicking::panic_fmt(__internal__:0)
        at core::slice::index::slice_end_index_len_fail(__internal__:0)
        at arrow_data::transform::variable_size::build_extend::{{closure}}(__internal__:0)
        at arrow_data::transform::MutableArrayData::extend(__internal__:0)
        at comet::execution::operators::copy_array(__internal__:0)
        at comet::execution::operators::copy_or_cast_array(__internal__:0)
        at <comet::execution::operators::copy::CopyStream as futures_core::stream::Stream>::poll_next(__internal__:0)
        at <datafusion_physical_plan::stream::RecordBatchStreamAdapter<S> as futures_core::stream::Stream>::poll_next(__internal__:0)
        at Java_org_apache_comet_Native_executePlan(__internal__:0)
        at <unknown>(__internal__:0).
    at org.apache.comet.Native.executePlan(Native Method)
    at org.apache.comet.CometExecIterator.getNextBatch(CometExecIterator.scala:102)
    at org.apache.comet.CometExecIterator.hasNext(CometExecIterator.scala:125)
    at org.apache.spark.sql.comet.execution.shuffle.CometShuffleWriteProcessor.write(CometShuffleExchangeExec.scala:492)

Steps to reproduce

No response

Expected behavior

No response

Additional context

No response

viirya commented 5 months ago

@andygrove Is there any reproducible example?

viirya commented 5 months ago

Btw, it doesn't fail in shuffle writer.

From the stack trace, it looks like when the writer to pull next batch from its upstream, the error happened.

andygrove commented 5 months ago

@andygrove Is there any reproducible example?

I was running TPC-H @ 100gb

andygrove commented 5 months ago

It failed on q10. I am adding debug logging to see if I can track this down.

andygrove commented 5 months ago

The error is happening in unsafe code in arrow-rs.

Here is some debug output showing the calls leading up the the error:

copy_or_cast_array() len=8192, type=Utf8
copy_array(typeUtf8, len=8192) before mutable.extend(0, 0, 8192) data len = 8192
getNextBatch id=2284, plan=135761185342032 failed during native execution: range end index 294912 out of range for slice of length 147456

Note that there are many earlier calls that look identical and do not fail.

The error happens in arrow_data::transform::variable_size::build_extend:

        at core::panicking::panic_fmt(__internal__:0)
        at core::slice::index::slice_end_index_len_fail(__internal__:0)
        at arrow_data::transform::variable_size::build_extend::{{closure}}(__internal__:0)
        at arrow_data::transform::MutableArrayData::extend(__internal__:0)
        at comet::execution::operators::copy_array(__internal__:0)
        at comet::execution::operators::copy_or_cast_array(__internal__:0)

This function calls get_last_offset to get the last offset and there are some documented assumptions. I wonder if we are violating any of these?

pub(super) unsafe fn get_last_offset<T: ArrowNativeType>(offset_buffer: &MutableBuffer) -> T {
    // JUSTIFICATION
    //  Benefit
    //      20% performance improvement extend of variable sized arrays (see bench `mutable_array`)
    //  Soundness
    //      * offset buffer is always extended in slices of T and aligned accordingly.
    //      * Buffer[0] is initialized with one element, 0, and thus `mutable_offsets.len() - 1` is always valid.
    let (prefix, offsets, suffix) = offset_buffer.as_slice().align_to::<T>();
    debug_assert!(prefix.is_empty() && suffix.is_empty());
    *offsets.get_unchecked(offsets.len() - 1)
}
andygrove commented 5 months ago

I ran with a debug build and got a more detailed stack trace:

        at core::panicking::panic_fmt(/rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/core/src/panicking.rs:72)
        at core::slice::index::slice_end_index_len_fail_rt(/rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/core/src/slice/index.rs:76)
        at core::slice::index::slice_end_index_len_fail(/rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/core/src/slice/index.rs:68)
        at <core::ops::range::Range<usize> as core::slice::index::SliceIndex<[T]>>::index(/rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/core/src/slice/index.rs:394)
        at core::slice::index::<impl core::ops::index::Index<I> for [T]>::index(/rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/core/src/slice/index.rs:18)
        at arrow_data::transform::variable_size::extend_offset_values(/home/andy/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-data-52.0.0/src/transform/variable_size.rs:38)
        at arrow_data::transform::variable_size::build_extend::{{closure}}(/home/andy/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-data-52.0.0/src/transform/variable_size.rs:57)
        at <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call(/rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/alloc/src/boxed.rs:2029)
        at arrow_data::transform::MutableArrayData::extend(/home/andy/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-data-52.0.0/src/transform/mod.rs:649)
        at comet::execution::operators::copy_array(/home/andy/git/apache/datafusion-comet/core/src/execution/operators/mod.rs:70)

The error happens in extend_offset_values on this line:

let new_values = &values[start_values..end_values];
andygrove commented 5 months ago

Here is some more debug info, showing the size of the buffers in the array data and the value in the first element of each buffer:

copy_array(typeInt64, len=8192) after mutable.extend(0, 0, 8192) data len = 8192
copy_or_cast_array() len=8192, type=Utf8
copy_array num buffers=2
BUF[0] len=32772 cap=32772
buf[0][0] = 0
BUF[1] len=147456 cap=147456
buf[1][0] = 67
getNextBatch id=2284, plan=126232496596768 failed during native execution: range end index 294912 out of range for slice of length 147456
viirya commented 5 months ago

Ok. I tried TPC-H @ 100gb and TPC-H @ 1gb. Only TPC-H @ 100gb can reproduce it. I will look into this.

viirya commented 5 months ago

I got some hint (I know which operator causes it) after debugging it for a while, though I don't get the root cause yet.

viirya commented 5 months ago

I've just figured out where the root cause is. I will go to propose a fix to arrow-rs.

EDIT: actually the issue is happened in Java Arrow instead of arrow-rs.

viirya commented 5 months ago

I opened an issue at Java Arrow repo and described the root cause: https://github.com/apache/arrow-java/issues/74. Fixing it there might wait for a longer release period. I'm thinking to copy BufferImportTypeVisitor into Comet and fix in Comet.