apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.22k stars 439 forks source link

[VL] Exit spark-sql will cause core dump in libhdfs.so #8072

Open liujiayi771 opened 1 day ago

liujiayi771 commented 1 day ago

Backend

VL (Velox)

Bug description

After executing the SQL, if I exit the spark-sql command line using Ctrl+C or quit command, a core dump occurs. https://github.com/apache/incubator-gluten/pull/6172

Spark version

Spark-3.4.x

Spark configurations

No response

System information

No response

Relevant logs

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000154777e0dcb6, pid=1258650, tid=0x00001547a437f640
#
# JRE version: OpenJDK Runtime Environment (8.0_432-b06) (build 1.8.0_432-b06)
# Java VM: OpenJDK 64-Bit Server VM (25.432-b06 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libhdfs.so+0x2cb6]  globalClassReference+0xb6
#
# Core dump written. Default location: /root/core or core.1258650
#
# An error report file with more information is saved as:
# /root/hs_err_pid1258650.log
#
# If you would like to submit a bug report, please visit:
#   https://access.redhat.com/support/cases/
#
#0  0x00001544daa78005 in raise () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x1544c157f640 (LWP 3935772))]
(gdb) bt
#0  0x00001544daa78005 in raise () from /lib64/libc.so.6
#1  0x00001544daa4a894 in abort () from /lib64/libc.so.6
#2  0x00001544d8c144d7 in os::abort(bool) [clone .cold] () from /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.432.b06-2.0.2.1.al8.x86_64/jre/lib/amd64/server/libjvm.so
#3  0x00001544d95dceca in VMError::report_and_die() () from /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.432.b06-2.0.2.1.al8.x86_64/jre/lib/amd64/server/libjvm.so
#4  0x00001544d93c839a in JVM_handle_linux_signal () from /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.432.b06-2.0.2.1.al8.x86_64/jre/lib/amd64/server/libjvm.so
#5  0x00001544d93bb49c in signalHandler(int, siginfo_t*, void*) () from /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.432.b06-2.0.2.1.al8.x86_64/jre/lib/amd64/server/libjvm.so
#6  <signal handler called>
#7  0x0000154495624cb6 in globalClassReference (className=className@entry=0x15449562d5c8 "org/apache/hadoop/fs/FileSystem", env=env@entry=0x1544d87b72e8, out=out@entry=0x1544c157cbf8)
    at xxx/hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfs/jni_helper.c:299
#8  0x0000154495624ef4 in invokeMethod (env=0x1544d87b72e8, retval=0x0, methType=INSTANCE, instObj=0x1544d86c1bd0, className=0x15449562d5c8 "org/apache/hadoop/fs/FileSystem", methName=0x15449562d9da "close",
    methSignature=0x15449562d41a "()V") at xxx/hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfs/jni_helper.c:123
#9  0x0000154495627e80 in hdfsDisconnect (fs=0x1544d86c1bd0) at xxx/hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfs/hdfs.c:880
liujiayi771 commented 1 day ago

cc @zhouyuan.

zhouyuan commented 1 day ago

@liujiayi771 thanks for reporting, it looks like unload the libhhdfs.so is not working properly in your testing env. would it be convenient to also paste the detail log in /root/hs_err_pid1258650.log ?

thanks, -yuan

liujiayi771 commented 1 day ago

@zhouyuan I have added the error stack in the description.

zhouyuan commented 18 hours ago

@liujiayi771 is the libhdfs.so from a vanilla HDFS project or it's been customized? Based on the stack, it looks like HDFS is trying to invoke the close method https://github.com/apache/hadoop/blob/branch-3.0/hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfs/jni_helper.c#L123 but can not find the right symbol(?) then call the clean up function https://github.com/apache/hadoop/blob/branch-3.0/hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfs/jni_helper.c#L299

thanks, -yuan

liujiayi771 commented 14 hours ago

@zhouyuan We have our customized HDFS. But the hdfs.c and jni_helper.c is same as branch-3.0 in hadoop repo. We have never modified the libhdfs code. But I will test it with the vanilla HDFS. Can you reproduce this issue?

liujiayi771 commented 14 hours ago

I checked the code for FileSystem in our code, and the close() method, which is a basic interface, definitely hasn't been modified. It's strange that JNI couldn't find this method.

zhouyuan commented 13 hours ago

Hi @liujiayi771 , I tried locally but seems not able to trigger it. I think we may need to add more guards in Velox filesystem close() CC @JkSelf for her comments

Thanks, -yuan

JkSelf commented 11 hours ago

@liujiayi771 Can you help to test adding this command before running your application? export CLASSPATH=$HADOOP_HOME/bin/hdfs classpath --glob

liujiayi771 commented 8 hours ago

@JkSelf I have tested it, and it still results in a core dump. I will investigate this issue further in the next few days.