Closed niloc132 closed 1 week ago
It looks like invoking the DH ShutdownManager in the atexit
hook largely resolves this - the process no longer (in 2 hrs of consecutive 3-4s runs) exits with a code other than zero... but it does very intermittently encounter a SIGSEGV inside of Java. This is partially written to disk before the python process ends, so the bad exit code never happens (and the rest of the dump isn't completed either).
This new error seems to be that most of Python is shut down but the cleanup thread is still running. It would be nice if we could trigger this thread shutdown directly, either from python or from our own Java shutdown code.
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x0000647a5cbc7480, pid=122266, tid=122351
#
# JRE version: OpenJDK Runtime Environment Temurin-21.0.4+7 (21.0.4+7) (build 21.0.4+7-LTS)
# Java VM: OpenJDK 64-Bit Server VM Temurin-21.0.4+7 (21.0.4+7-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C [python+0x230480]
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" (or dumping to /project/core.122266)
#
# If you would like to submit a bug report, please visit:
# https://github.com/adoptium/adoptium-support/issues
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
--------------- S U M M A R Y ------------
Command Line: -Djpy.jpyLib=/opt/deephaven/venv/lib/python3.10/site-packages/jpy.cpython-310-x86_64-linux-gnu.so -Djpy.jdlLib=/opt/deephaven/venv/lib/python3.10/site-packages/jdl.cpython-310-x86_64-linux-gnu.so -Djpy.pythonLib=/usr/lib/python3.10/config-3.10-x86_64-linux-gnu/libpython3.10.so -Djpy.pythonPrefix=/opt/deephaven/venv -Djpy.pythonExecutable=/opt/deephaven/venv/bin/python -DPythonDeephavenSession.initScripts= -DLoggerFactory.silenceOnProcessEnvironment=true -Dstdout.toLogBuffer=false -Dstderr.toLogBuffer=false -Dlogback.configurationFile=logback-minimal.xml -Xrs -XX:+UnlockDiagnosticVMOptions -XX:CompilerDirectivesFile=/opt/deephaven/venv/lib/python3.10/site-packages/deephaven_server/jars/dh-compiler-directives.txt -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:+UseStringDeduplication -XX:GCLockerRetryAllocationCount=128 --add-opens=java.base/java.nio=ALL-UNNAMED --add-exports=java.management/sun.management=ALL-UNNAMED --add-exports=java.base/jdk.internal.misc=ALL-UNNAMED
Host: AMD Ryzen Threadripper 1950X 16-Core Processor, 32 cores, 31G, Ubuntu 22.04.4 LTS
Time: Fri Aug 30 15:34:45 2024 UTC elapsed time: 3.374934 seconds (0d 0h 0m 3s)
--------------- T H R E A D ---------------
Current thread (0x0000647a68ea7020): JavaThread "PyObject-cleanup" daemon [_thread_in_native, id=122351, stack(0x000071ea4c5c7000,0x000071ea4c6c7000) (1024K)]
Stack: [0x000071ea4c5c7000,0x000071ea4c6c7000], sp=0x000071ea4c6c5580, free space=1017k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C [python+0x230480]
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j org.jpy.PyLib.callAndReturnValue(JZLjava/lang/String;I[Ljava/lang/Object;[Ljava/lang/Class;Ljava/lang/Class;)Ljava/lang/Object;+0
j org.jpy.PyProxyHandler.invoke(Ljava/lang/Object;Ljava/lang/reflect/Method;[Ljava/lang/Object;)Ljava/lang/Object;+291
j org.jpy.$Proxy12.cleanupOnlyUseFromGIL()I+9
j org.jpy.PyObjectReferences.cleanupThreadLogic()V+15
j org.jpy.PyObjectReferences$$Lambda+0x000071ea043b8c80.run()V+4
j java.lang.Thread.runWith(Ljava/lang/Object;Ljava/lang/Runnable;)V+5 java.base@21.0.4
j java.lang.Thread.run()V+19 java.base@21.0.4
v ~StubRoutines::call_stub 0x000071ea74617cc6
siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000010
Registers:
RAX=0x0000000000000000, RBX=0x0000000000000001, RCX=0x0000000000000001, RDX=0x000071ea8616f5f0
RSP=0x000071ea4c6c5580, RBP=0x0000647a5cf38ae0, RSI=0x0000000000000001, RDI=0x0000000000000000
R8 =0x0000000000000000, R9 =0x0000000000000000, R10=0x000071ea74623991, R11=0x000000060c9e75d8
R12=0x0000647a68ea73d8, R13=0x0000000000000000, R14=0x000071ea8616f5f0, R15=0x000071ea4c6c56d8
RIP=0x0000647a5cbc7480, EFLAGS=0x0000000000010246, CSGSFS=0x002b000000000033, ERR=0x0000000000000004
TRAPNO=0x000000000000000e
Register to memory mapping:
RAX=0x0 is null
RBX=0x0000000000000001 is an unknown value
RCX=0x0000000000000001 is an unknown value
RDX=0x000071ea8616f5f0 points into unknown readable memory: 0x0000000000000001 | 01 00 00 00 00 00 00 00
RSP=0x000071ea4c6c5580 is pointing into the stack for thread: 0x0000647a68ea7020
RBP=0x0000647a5cf38ae0: _PyRuntime+0x0000000000000000 in python at 0x0000647a5c997000
RSI=0x0000000000000001 is an unknown value
RDI=0x0 is null
R8 =0x0 is null
R9 =0x0 is null
R10=0x000071ea74623991 is at code_begin+1009 in an Interpreter codelet
native method entry point (kind = native) [0x000071ea746235a0, 0x000071ea74624010] 2672 bytes
R11=0x000000060c9e75d8 is an oop: java.lang.Class
{0x000000060c9e75d8} - klass: 'java/lang/Class'
- ---- fields (total size 16 words):
- private volatile transient 'classRedefinedCount' 'I' @12 0 (0x00000000)
- injected 'klass' 'J' @16 125249903401936 (0x000071ea04001bd0)
- injected 'array_klass' 'J' @24 0 (0x0000000000000000)
- injected 'oop_size' 'I' @32 16 (0x00000010)
- injected 'static_oop_field_count' 'I' @36 2 (0x00000002)
- private volatile transient 'cachedConstructor' 'Ljava/lang/reflect/Constructor;' @40 null (0x00000000)
- private transient 'name' 'Ljava/lang/String;' @44 "org.jpy.PyLib"{0x000000060c9e7658} (0xc193cecb)
- private transient 'module' 'Ljava/lang/Module;' @48 a 'java/lang/Module'{0x000000060c46b108} (0xc188d621)
- private final 'classLoader' 'Ljava/lang/ClassLoader;' @52 a 'jdk/internal/loader/ClassLoaders$AppClassLoader'{0x000000060c454120} (0xc188a824)
- private transient 'classData' 'Ljava/lang/Object;' @56 null (0x00000000)
- private transient 'packageName' 'Ljava/lang/String;' @60 "org.jpy"{0x000000060c46b140} (0xc188d628)
- private final 'componentType' 'Ljava/lang/Class;' @64 null (0x00000000)
- private volatile transient 'reflectionData' 'Ljava/lang/ref/SoftReference;' @68 null (0x00000000)
- private volatile transient 'genericInfo' 'Lsun/reflect/generics/repository/ClassRepository;' @72 null (0x00000000)
- private volatile transient 'enumConstants' '[Ljava/lang/Object;' @76 null (0x00000000)
- private volatile transient 'enumConstantDirectory' 'Ljava/util/Map;' @80 null (0x00000000)
- private volatile transient 'annotationData' 'Ljava/lang/Class$AnnotationData;' @84 null (0x00000000)
- private volatile transient 'annotationType' 'Lsun/reflect/annotation/AnnotationType;' @88 null (0x00000000)
- transient 'classValueMap' 'Ljava/lang/ClassValue$ClassValueMap;' @92 null (0x00000000)
- injected 'protection_domain' 'Ljava/lang/Object;' @96 a 'java/security/ProtectionDomain'{0x000000060c46b170} (0xc188d62e)
- injected 'signers_name' 'Ljava/lang/Object;' @100 null (0x00000000)
- injected 'source_file' 'Ljava/lang/Object;' @104 null (0x00000000)
- signature: Lorg/jpy/PyLib;
- ---- static fields (2):
- private static final 'DEBUG' 'Z' @120 false (0x00)
- private static final 'ON_WINDOWS' 'Z' @121 false (0x00)
- private static final 'STOP_IS_NO_OP' 'Z' @122 false (0x00)
- private static 'dllFilePath' 'Ljava/lang/String;' @112 "/opt/deephaven/venv/lib/python3.10/site-packages/jpy.cpython-310-x86_64-linux-gnu.so"{0x000000060c75a1c0} (0xc18eb438)
- private static 'dllProblem' 'Ljava/lang/Throwable;' @116 null (0x00000000)
- private static 'dllLoaded' 'Z' @123 true (0x01)
R12=0x0000647a68ea73d8 points into unknown readable memory: 0x000071ea85ee7160 | 60 71 ee 85 ea 71 00 00
R13=0x0 is null
R14=0x000071ea8616f5f0 points into unknown readable memory: 0x0000000000000001 | 01 00 00 00 00 00 00 00
R15=0x000071ea4c6c56d8 is pointing into the stack for thread: 0x0000647a68ea7020
<end of file>
Edit:
It appears that setting -DPyObject.cleanup_on_thread=false
does prevent this entirely, will let tests run in the background for a while to confirm
We've seen this repeatedly in CI - after the new
py-embedded-server:check
task passes its tests and writes the junit xml report, the python process crashes with a JVM segfault error.Here are three example crashing frames
All three of these are reproducible, I've seen each multiple times now while testing locally. None contain any Deephaven code or other recognizable library bytecode that we depend on, except the third, which apparently has jetty objects in registers. With that said, also note that all but the first have truncated stack traces, with the message
error occurred during error reporting (printing target Java thread stack)
.To test this, I start the docker image that was built for the test, and run this command:
This runs until the test command fails, and logs the time it took until failure was hit.
The bug seems to require a few things to reproduce, which is to say that changing these things stopped the crash from happening on repeated runs:
jpy.get_type()
or other jpy interactions), but can all be replaced with a single line, such asimport garbage
. Naturally this must be wrapped in try/except to avoid failing the test. This is a simplified version of thedeephaven.dbc.odbc
import, which tries and fails to importturbodbc.cursor
- no other imports are required.While removing various parts of the EmbeddedServer and DeephavenApiServer, a new error was discovered:
This is apparently a known CPython issue, where during shutdown another thread can write to stdout/stderr and cause a crash: https://github.com/python/cpython/issues/86883. This is related to EmbeddedServer redirecting Java's System.out and System.err to python's sys.out and sys.err, as commenting out that redirection prevents this issue: https://github.com/deephaven/deephaven-core/blob/d7034dd28bbefcd1d0515305448ca65e8c72ef5e/py/embedded-server/java-runtime/src/main/java/io/deephaven/python/server/EmbeddedServer.java#L139-L142
It seems possible to mitigate this through the use of
atexit.register()
, calling into Java to closeSystem.out
andSystem.err
. Ideally we might go further here and call jpy'sdestroy_jvm()
function, but this hangs indefinitely at this time.