apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.21k stars 435 forks source link

[VL] Crashed on jni_SetByteArrayRegion when trying to serialize the broadcast data #7624

Open NEUpanning opened 3 weeks ago

NEUpanning commented 3 weeks ago

Backend

VL (Velox)

Bug description

Stack trace

(gdb) bt
#0  0x00007fdf34ec5387 in raise () from /lib64/libc.so.6
#1  0x00007fdf34ec6a78 in abort () from /lib64/libc.so.6
#2  0x00007fdf33b32f85 in os::abort(bool) () from /usr/local/jdk1.8.0_112/jre/lib/amd64/server/libjvm.so
#3  0x00007fdf33cd5383 in VMError::report_and_die() () from /usr/local/jdk1.8.0_112/jre/lib/amd64/server/libjvm.so
#4  0x00007fdf33b3848f in JVM_handle_linux_signal () from /usr/local/jdk1.8.0_112/jre/lib/amd64/server/libjvm.so
#5  0x00007fdf33b2e9d3 in signalHandler(int, siginfo*, void*) () from /usr/local/jdk1.8.0_112/jre/lib/amd64/server/libjvm.so
#6  <signal handler called>
#7  0x00007fdf338dea82 in jni_SetByteArrayRegion () from /usr/local/jdk1.8.0_112/jre/lib/amd64/server/libjvm.so
#8  0x00007fdec47a53a7 in JNIEnv_::SetByteArrayRegion (this=0x7fdec87f01f8, array=0x0, start=0, len=1117907072, buf=0x7fdd602005c0 "\361\311g")
    at /usr/lib/jvm/java-1.8.0-openjdk/include/jni.h:1769
#9  0x00007fdec47a33d1 in Java_org_apache_gluten_vectorized_ColumnarBatchSerializerJniWrapper_serialize (env=0x7fdec87f01f8, wrapper=0x7fde941fd028,
    handles=0x7fde941fd020) at /opt/meituan/panning/gluten-dev/gluten/dev/meituan/src/gluten/cpp/core/jni/JniWrapper.cc:1243
#10 0x00007fdf231a3a34 in ?? ()
#11 0x00007fdec87f0000 in ?? ()
#12 0x00007fdf231a3782 in ?? ()
#13 0x00007fde941fcfc0 in ?? ()
#14 0x00007fde91af7400 in ?? ()
#15 0x00007fde941fd028 in ?? ()
#16 0x00007fde91af7648 in ?? ()
#17 0x0000000000000000 in ?? ()

bufferArr being nullptr caused this crash.

(gdb) f 9
#9  0x00007fdec47a33d1 in Java_org_apache_gluten_vectorized_ColumnarBatchSerializerJniWrapper_serialize (env=0x7fdec87f01f8, wrapper=0x7fde941fd028,
    handles=0x7fde941fd020) at /opt/meituan/panning/gluten-dev/gluten/dev/meituan/src/gluten/cpp/core/jni/JniWrapper.cc:1243
warning: Source file is more recent than executable.
1243      env->SetByteArrayRegion(bufferArr, 0, buffer->size(), reinterpret_cast<const jbyte*>(buffer->data()));
(gdb) list
1238      }
1239
1240      auto serializer = ctx->createColumnarBatchSerializer(nullptr);
1241      auto buffer = serializer->serializeColumnarBatches(batches);
1242      auto bufferArr = env->NewByteArray(buffer->size());
1243      env->SetByteArrayRegion(bufferArr, 0, buffer->size(), reinterpret_cast<const jbyte*>(buffer->data()));
1244
1245      jobject columnarBatchSerializeResult =
1246          env->NewObject(columnarBatchSerializeResultClass, columnarBatchSerializeResultConstructor, numRows, bufferArr);
1247
(gdb) p bufferArr
$1 = (_jbyteArray *) 0x0

Spark version

Spark-3.0.x

Spark configurations

No response

System information

No response

Relevant logs

Relates log in stdout:

# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill %p"
#   Executing /bin/sh -c "kill 33638"...
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fc6cd8dea82, pid=33638, tid=0x00007fc6862fe700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_112-b15) (build 1.8.0_112-b15)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.112-b15 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x6d0a82]  jni_SetByteArrayRegion+0xc2
#
# Core dump written. Default location: /data2/hadoop/yarn/nm-local-dir/usercache/hadoop-data-governance/appcache/application_1715238577190_8373739/container_e74_1715238577190_8373739_01_000017/core or core.33638 (max size 4194304 kB). To ensure a full core dump, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /data11/hadoop/yarn/userlogs/application_1715238577190_8373739/container_e74_1715238577190_8373739_01_000017/jvm_error_pid_33638.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp

Relates log in jvm_error_pid_33638.log

Heap:
 PSYoungGen      total 432640K, used 11475K [0x00000000e0000000, 0x0000000100000000, 0x0000000100000000)
  eden space 407552K, 2% used [0x00000000e0000000,0x00000000e0b34fb0,0x00000000f8e00000)
  from space 25088K, 0% used [0x00000000fa800000,0x00000000fa800000,0x00000000fc080000)
  to   space 26624K, 0% used [0x00000000f8e00000,0x00000000f8e00000,0x00000000fa800000)
 ParOldGen       total 1048576K, used 12476K [0x00000000a0000000, 0x00000000e0000000, 0x00000000e0000000)
  object space 1048576K, 1% used [0x00000000a0000000,0x00000000a0c2f028,0x00000000e0000000)
 Metaspace       used 52658K, capacity 55918K, committed 59008K, reserved 1099776K
  class space    used 6883K, capacity 7205K, committed 7808K, reserved 1048576K

Card table byte_map: [0x00007fc6bc8cb000,0x00007fc6bcbcc000] byte_map_base: 0x00007fc6bc3cb000
NEUpanning commented 3 weeks ago

JVM heap didn't have enough memory to allocate the array, so env->NewByteArray returned null. This caused the crash. Maybe we should add a check for bufferArr != nullptr and return a more readable error message.

NEUpanning commented 2 weeks ago

In our case, we allocate more off-heap memory and less on-heap memory to Gluten compared to vanilla Spark. As a result, vanilla Spark can succeed with enough on-heap memory to execute broadcasts, but Gluten fails.

zhztheplayer commented 2 weeks ago

In our case, we allocate more off-heap memory and less on-heap memory to Gluten compared to vanilla Spark. As a result, vanilla Spark can succeed with enough on-heap memory to execute broadcasts, but Gluten fails.

Thanks for sharing.

This caused the crash. Maybe we should add a check for bufferArr != nullptr and return a more readable error message.

Agreed. We'd avoid crash anyway.