apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.22k stars 438 forks source link

Run `jstack` against Spark Driver process failed on MacOS M1 #5702

Open zhouyifan279 opened 6 months ago

zhouyifan279 commented 6 months ago

Backend

VL (Velox)

Bug description

Launch spark-sql in local mode and run jstack against it:

export gluten_jar=/Users/zhouyifan/git/incubator-gluten/package/target/gluten-velox-bundle-spark3.5_2.12-osx_14.4_aarch_64-1.2.0-SNAPSHOT.jar

./bin/spark-sql \
  --conf spark.plugins=org.apache.gluten.GlutenPlugin \
  --conf spark.memory.offHeap.enabled=true \
  --conf spark.memory.offHeap.size=20g \
  --conf spark.driver.extraClassPath=${gluten_jar} \
  --conf spark.executor.extraClassPath=${gluten_jar} \
  --conf spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager

jstack exits with error message

74551: Unable to open socket file /var/folders/yj/25xqj6_52n51xmctftgl_77c0000gn/T/.attach_pid74551: target process 74551 doesn't respond within 10500ms or HotSpot VM not loaded

Spark version

spark-3.5.1-bin-hadoop3

Spark configurations

No response

System information

JDK

openjdk version "1.8.0_402"
OpenJDK Runtime Environment (Zulu 8.76.0.17-CA-macos-aarch64) (build 1.8.0_402-b06)
OpenJDK 64-Bit Server VM (Zulu 8.76.0.17-CA-macos-aarch64) (build 25.402-b06, mixed mode)

System

Velox System Info v0.0.2
Commit: 82e50ab196caff398013a3e76ca3b854a1156243
CMake Version: 3.29.3
System: Darwin-23.4.0
Arch: arm64
C++ Compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
C++ Compiler Version: 15.0.0.15000309
C Compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc
C Compiler Version: 15.0.0.15000309
CMake Prefix Path: /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.4.sdk/usr;/opt/homebrew;/usr/local;/usr;/;/opt/homebrew/Cellar/cmake/3.29.3;/usr/local;/usr/X11R6;/usr/pkg;/opt;/sw;/opt/local

Relevant logs

No response

zhouyifan279 commented 6 months ago

Verified that:

  1. Oracle JDK 8 aarch64 (MacOS) has this bug.
  2. OpenJDK 17 aarch64 (MacOS) has this bug.
  3. OpenJDK 8 amd64 (Ubuntu) does not have this bug.
zhouyifan279 commented 6 months ago

Adding JVM option -XX:+StartAttachListener can make jstack work:

./bin/spark-sql \
  --conf spark.plugins=org.apache.gluten.GlutenPlugin \
  --conf spark.memory.offHeap.enabled=true \
  --conf spark.memory.offHeap.size=20g \
  --conf spark.driver.extraClassPath=${gluten_jar} \
  --conf spark.executor.extraClassPath=${gluten_jar} \
  --conf spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager \
  --conf spark.driver.extraJavaOptions=-XX:+StartAttachListener
zhouyifan279 commented 6 months ago

According to this doc, jstack communicates with JVM via a local socket file under JVM tmpdir, with filename pattern .java_pid. JVM creates .java_pid file when it receives SIGNAL_QUIT.

I ran the following test cases and observed different behavior of .java_pid file.

  1. JVM option -XX:+StartAttachListener is specified, .java_pid file is present when JVM starts.
  2. If JVM option -XX:+StartAttachListener is not specified and --conf spark.plugins=org.apache.gluten.GlutenPlugin is removed, .java_pid file is present after executing jstack.
  3. If JVM option -XX:+StartAttachListener is not specified and --conf spark.plugins=org.apache.gluten.GlutenPlugin is present, .java_pid file is not present event after executing jstack.
zhouyifan279 commented 6 months ago

A simplified program call reproduce this Bug.

import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStream;

public class AttachListener {

  public static void main(String[] args) throws InterruptedException {
    File file = extractVeloxLibrary();
    System.load(file.getAbsolutePath());
    System.out.println("Library velox loaded");
    Thread.sleep(Long.MAX_VALUE);
  }

  static File extractVeloxLibrary() {
    String tmpdir = System.getProperty("java.io.tmpdir");
    File file = new File(tmpdir, "libvelox.dylib");
    if (file.exists()) {
      file.delete();
    }
    try (InputStream is = AttachListener.class.getResourceAsStream("/libvelox.dylib");
         FileOutputStream fos = new FileOutputStream(file)) {
      byte[] buffer = new byte[4096];
      int read;
      while ((read = is.read(buffer)) != -1) {
        fos.write(buffer, 0, read);
      }
    } catch (java.io.IOException e) {
      throw new RuntimeException("Failed to extract library", e);
    }
    return file;
  }
}

Compile and run:

javac AttachListener.java
java -cp /path/to/incubator-gluten/package/target/gluten-velox-bundle-spark3.5_2.12-osx_14.4_aarch_64-1.2.0-SNAPSHOT.jar:/path/to/spark-3.5.1-bin-hadoop3/jars/*:. AttachListener

jstack also fails on AttachListener process.

zhouyifan279 commented 6 months ago

I guess libvelox.dylib affected JVM's internal mechanism. But I'm not a JVM expert and have little knowledge about libvelox.dylib. I can't dig deeper to find the root cause.

OpenJDK Project has a similar issue: https://bugs.openjdk.org/browse/JDK-8235211, but seems not relevant.

xumingming commented 6 months ago

I am using macOS(Apple Silicon), JDK:

openjdk version "1.8.0_402"
OpenJDK Runtime Environment (Zulu 8.76.0.17-CA-macos-aarch64) (build 1.8.0_402-b06)
OpenJDK 64-Bit Server VM (Zulu 8.76.0.17-CA-macos-aarch64) (build 25.402-b06, mixed mode)

and jstack works