deephaven / deephaven-core

Deephaven Community Core
Other
249 stars 79 forks source link

Python deephaven-server segfault on shutdown #6005

Closed niloc132 closed 1 week ago

niloc132 commented 2 weeks ago

We've seen this repeatedly in CI - after the new py-embedded-server:check task passes its tests and writes the junit xml report, the python process crashes with a JVM segfault error.

Here are three example crashing frames

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007ff9d8d6f5ed, pid=1828, tid=1835
#
# JRE version: OpenJDK Runtime Environment Temurin-21.0.4+7 (21.0.4+7) (build 21.0.4+7-LTS)
# Java VM: OpenJDK 64-Bit Server VM Temurin-21.0.4+7 (21.0.4+7-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x56f5ed]  frame::sender(RegisterMap*) const+0x1ed
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" (or dumping to /project/core.1828)
#
# If you would like to submit a bug report, please visit:
#   https://github.com/adoptium/adoptium-support/issues
#

---------------  S U M M A R Y ------------

Command Line: -Djpy.jpyLib=/opt/deephaven/venv/lib/python3.10/site-packages/jpy.cpython-310-x86_64-linux-gnu.so -Djpy.jdlLib=/opt/deephaven/venv/lib/python3.10/site-packages/jdl.cpython-310-x86_64-linux-gnu.so -Djpy.pythonLib=/usr/lib/python3.10/config-3.10-x86_64-linux-gnu/libpython3.10.so -Djpy.pythonPrefix=/opt/deephaven/venv -Djpy.pythonExecutable=/opt/deephaven/venv/bin/python -DPythonDeephavenSession.initScripts= -DLoggerFactory.silenceOnProcessEnvironment=true -Dstdout.toLogBuffer=false -Dstderr.toLogBuffer=false -Dlogback.configurationFile=logback-minimal.xml -Xrs -XX:+UnlockDiagnosticVMOptions -XX:CompilerDirectivesFile=/opt/deephaven/venv/lib/python3.10/site-packages/deephaven_server/jars/dh-compiler-directives.txt -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:+UseStringDeduplication -XX:GCLockerRetryAllocationCount=128 --add-opens=java.base/java.nio=ALL-UNNAMED --add-exports=java.management/sun.management=ALL-UNNAMED --add-exports=java.base/jdk.internal.misc=ALL-UNNAMED 

Host: AMD Ryzen Threadripper 1950X 16-Core Processor, 32 cores, 31G, Ubuntu 22.04.4 LTS
Time: Mon Aug 19 16:16:44 2024 UTC elapsed time: 4.994699 seconds (0d 0h 0m 4s)

---------------  T H R E A D  ---------------

Current thread (0x0000640737e61c20):  VMThread "VM Thread"          [id=1835, stack(0x00007ff9ad8fa000,0x00007ff9ad9fa000) (1024K)] _threads_hazard_ptr=0x00007ff8ec230b20

Stack: [0x00007ff9ad8fa000,0x00007ff9ad9fa000],  sp=0x00007ff9ad9f7380,  free space=1012k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x56f5ed]  frame::sender(RegisterMap*) const+0x1ed
V  [libjvm.so+0x928369]  JavaThread::deoptimize_marked_methods()+0xb9
V  [libjvm.so+0x89d156]  HandshakeOperation::do_handshake(JavaThread*)+0x46
V  [libjvm.so+0x89d546]  HandshakeState::try_process(HandshakeOperation*) [clone .part.0]+0x106
V  [libjvm.so+0x89ed00]  VM_HandshakeAllThreads::doit()+0x340
V  [libjvm.so+0x100ce57]  VM_Operation::evaluate()+0x107
V  [libjvm.so+0x10102af]  VMThread::evaluate_operation(VM_Operation*)+0x13f
V  [libjvm.so+0x1010edd]  VMThread::inner_execute(VM_Operation*)+0x1fd
V  [libjvm.so+0x1011337]  VMThread::run()+0xb7
V  [libjvm.so+0xf79b58]  Thread::call_run()+0xa8
V  [libjvm.so+0xd048ea]  thread_native_entry(Thread*)+0xda

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000011

Registers:
RAX=0x0000000000000009, RBX=0x00007ff9ad9f7448, RCX=0x00007ff9c841eb00, RDX=0x0000640737e42b90
RSP=0x00007ff9ad9f7380, RBP=0x00007ff9ad9f73c0, RSI=0x00007ff9ad9f7410, RDI=0x0000000000000000
R8 =0x00007ff9c841ea90, R9 =0x00007ff9c841ea80, R10=0x00007ff9d9d97000, R11=0x442071bb79d20cd4
R12=0x00007ff9ad9f7410, R13=0x00007ff9ad9f73d0, R14=0x00007ff9ad9f73d0, R15=0x00007ff9440191e0
RIP=0x00007ff9d8d6f5ed, EFLAGS=0x0000000000010283, CSGSFS=0x002b000000000033, ERR=0x0000000000000004
  TRAPNO=0x000000000000000e

Register to memory mapping:

RAX=0x0000000000000009 is an unknown value
RBX=0x00007ff9ad9f7448 points into unknown readable memory: 0x00007ff9ad9f76b0 | b0 76 9f ad f9 7f 00 00
RCX=0x00007ff9c841eb00 is pointing into interpreter code (not bytecode specific)
RDX=0x0000640737e42b90 points into unknown readable memory: 0x0000640737be4480 | 80 44 be 37 07 64 00 00
RSP=0x00007ff9ad9f7380 points into unknown readable memory: 0x00007ff9c841ea90 | 90 ea 41 c8 f9 7f 00 00
RBP=0x00007ff9ad9f73c0 points into unknown readable memory: 0x00007ff9ad9f86f0 | f0 86 9f ad f9 7f 00 00
RSI=0x00007ff9ad9f7410 points into unknown readable memory: 0x00007ff8b93f2280 | 80 22 3f b9 f8 7f 00 00
RDI=0x0 is null
R8 =0x00007ff9c841ea90 is at code_begin+-112 in 
[CodeBlob (0x00007ff9c841ea90)]
Framesize: 0
BufferBlob (0x00007ff9c841ea90) used for Interpreter
R9 =0x00007ff9c841ea80 points into unknown readable memory: 0x0000000000000301 | 01 03 00 00 00 00 00 00
R10=0x00007ff9d9d97000 points into unknown readable memory: 0x0100050403020100 | 00 01 02 03 04 05 00 01
R11=0x442071bb79d20cd4 is an unknown value
R12=0x00007ff9ad9f7410 points into unknown readable memory: 0x00007ff8b93f2280 | 80 22 3f b9 f8 7f 00 00
R13=0x00007ff9ad9f73d0 points into unknown readable memory: 0x00007ff8b93f2280 | 80 22 3f b9 f8 7f 00 00
R14=0x00007ff9ad9f73d0 points into unknown readable memory: 0x00007ff8b93f2280 | 80 22 3f b9 f8 7f 00 00
R15=0x00007ff9440191e0 points into unknown readable memory: 0x0000000000000000 | 00 00 00 00 00 00 00 00
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x000073e4024384e0, pid=11431, tid=11461
#
# JRE version: OpenJDK Runtime Environment Temurin-21.0.4+7 (21.0.4+7) (build 21.0.4+7-LTS)
# Java VM: OpenJDK 64-Bit Server VM Temurin-21.0.4+7 (21.0.4+7-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x8384e0]  G1ParCopyClosure<(G1Barrier)0, false>::do_oop(oopDesc**)+0x0
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" (or dumping to /project/core.11431)
#
# If you would like to submit a bug report, please visit:
#   https://github.com/adoptium/adoptium-support/issues
#

---------------  S U M M A R Y ------------

Command Line: -Djpy.jpyLib=/opt/deephaven/venv/lib/python3.10/site-packages/jpy.cpython-310-x86_64-linux-gnu.so -Djpy.jdlLib=/opt/deephaven/venv/lib/python3.10/site-packages/jdl.cpython-310-x86_64-linux-gnu.so -Djpy.pythonLib=/usr/lib/python3.10/config-3.10-x86_64-linux-gnu/libpython3.10.so -Djpy.pythonPrefix=/opt/deephaven/venv -Djpy.pythonExecutable=/opt/deephaven/venv/bin/python -DPythonDeephavenSession.initScripts= -DLoggerFactory.silenceOnProcessEnvironment=true -Dstdout.toLogBuffer=false -Dstderr.toLogBuffer=false -Dlogback.configurationFile=logback-minimal.xml -Xrs -XX:+UnlockDiagnosticVMOptions -XX:CompilerDirectivesFile=/opt/deephaven/venv/lib/python3.10/site-packages/deephaven_server/jars/dh-compiler-directives.txt -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:+UseStringDeduplication -XX:GCLockerRetryAllocationCount=128 --add-opens=java.base/java.nio=ALL-UNNAMED --add-exports=java.management/sun.management=ALL-UNNAMED --add-exports=java.base/jdk.internal.misc=ALL-UNNAMED 

Host: AMD Ryzen Threadripper 1950X 16-Core Processor, 32 cores, 31G, Ubuntu 22.04.4 LTS
Time: Mon Aug 19 16:45:00 2024 UTC elapsed time: 4.604654 seconds (0d 0h 0m 4s)

---------------  T H R E A D  ---------------

Current thread (0x000073e36c00ead0):  WorkerThread "GC Thread#6"    [id=11461, stack(0x000073e3d4efa000,0x000073e3d4ffa000) (1024K)]

Stack: [0x000073e3d4efa000,0x000073e3d4ffa000],  sp=0x000073e3d4ff75f8,  free space=1013k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x8384e0]  G1ParCopyClosure<(G1Barrier)0, false>::do_oop(oopDesc**)+0x0
V  [libjvm.so+0x926043]  JavaThread::oops_do_frames(OopClosure*, CodeBlobClosure*) [clone .part.0]+0xd3
V  [libjvm.so+0xf79625]  Thread::oops_do(OopClosure*, CodeBlobClosure*)+0x95
V  [libjvm.so+0xf85ba2]  Threads::possibly_parallel_oops_do(bool, OopClosure*, CodeBlobClosure*)+0x102
V  [libjvm.so+0x83ab72]  G1RootProcessor::process_java_roots(G1RootClosures*, G1GCPhaseTimes*, unsigned int)+0x72
V  [libjvm.so+0x83b251]  G1RootProcessor::evacuate_roots(G1ParScanThreadState*, unsigned int)+0x61
V  [libjvm.so+0x84f302]  G1EvacuateRegionsTask::scan_roots(G1ParScanThreadState*, unsigned int)+0x22
V  [libjvm.so+0x84f569]  G1EvacuateRegionsBaseTask::work(unsigned int)+0x89
V  [libjvm.so+0x1037140]  WorkerThread::run()+0x80
V  [libjvm.so+0xf79b58]  Thread::call_run()+0xa8
V  [libjvm.so+0xd048ea]  thread_native_entry(Thread*)+0xda
JavaThread 0x00005c37c3124e00 (nid = 11570) was being processed
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  jdk.internal.misc.Unsafe.park(ZJ)V+-1650398176 java.base@21.0.4
C  0x000073e188023be0
C  0x000073e31d5fe360
C  0x0000000000013b50

[error occurred during error reporting (printing target Java thread stack), id 0xb, SIGSEGV (0xb) at pc=0x000073e40225cff4]

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000008

Registers:
RAX=0x000073e40301c0f8, RBX=0x0000000000000010, RCX=0x000073e36c00ead0, RDX=0x000073e31d5fe290
RSP=0x000073e3d4ff75f8, RBP=0x000073e3d4ff76e0, RSI=0x0000000000000008, RDI=0x000073e2e4003748
R8 =0x0000000000000000, R9 =0x000073e3d4ff7778, R10=0x000073e404030000, R11=0x0000000000000000
R12=0x000073e2e4003748, R13=0x000073e3d4ff7740, R14=0x000000009da0ec20, R15=0x000073e3d4ff7778
RIP=0x000073e4024384e0, EFLAGS=0x0000000000010202, CSGSFS=0x002b000000000033, ERR=0x0000000000000004
  TRAPNO=0x000000000000000e

Register to memory mapping:

RAX=0x000073e40301c0f8: <offset 0x000000000141c0f8> in /opt/java/openjdk/lib/server/libjvm.so at 0x000073e401c00000
RBX=0x0000000000000010 is an unknown value
RCX=0x000073e36c00ead0 points into unknown readable memory: 0x000073e4030dcc00 | 00 cc 0d 03 e4 73 00 00
RDX=0x000073e31d5fe290 is pointing into the stack for thread: 0x00005c37c3124e00
RSP=0x000073e3d4ff75f8 points into unknown readable memory: 0x000073e40239fb11 | 11 fb 39 02 e4 73 00 00
RBP=0x000073e3d4ff76e0 points into unknown readable memory: 0x000073e3d4ff8a30 | 30 8a ff d4 e3 73 00 00
RSI=0x0000000000000008 is an unknown value
RDI=0x000073e2e4003748 points into unknown readable memory: 0x000073e40301c0f8 | f8 c0 01 03 e4 73 00 00
R8 =0x0 is null
R9 =0x000073e3d4ff7778 points into unknown readable memory: 0x0000000000000000 | 00 00 00 00 00 00 00 00
R10=0x000073e404030000 points into unknown readable memory: 0x0100050403020100 | 00 01 02 03 04 05 00 01
R11=0x0 is null
R12=0x000073e2e4003748 points into unknown readable memory: 0x000073e40301c0f8 | f8 c0 01 03 e4 73 00 00
R13=0x000073e3d4ff7740 points into unknown readable memory: 0x000073e31d5fe230 | 30 e2 5f 1d e3 73 00 00
R14=0x000000009da0ec20 is an unknown value
R15=0x000073e3d4ff7778 points into unknown readable memory: 0x0000000000000000 | 00 00 00 00 00 00 00 00
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x000072de1ea84e63, pid=2197, tid=2267
#
# JRE version: OpenJDK Runtime Environment Temurin-21.0.4+7 (21.0.4+7) (build 21.0.4+7-LTS)
# Java VM: OpenJDK 64-Bit Server VM Temurin-21.0.4+7 (21.0.4+7-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0xc84e63]  Method::bcp_from(unsigned char*) const+0x3
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" (or dumping to /project/core.2197)
#
# If you would like to submit a bug report, please visit:
#   https://github.com/adoptium/adoptium-support/issues
#

---------------  S U M M A R Y ------------

Command Line: -Djpy.jpyLib=/opt/deephaven/venv/lib/python3.10/site-packages/jpy.cpython-310-x86_64-linux-gnu.so -Djpy.jdlLib=/opt/deephaven/venv/lib/python3.10/site-packages/jdl.cpython-310-x86_64-linux-gnu.so -Djpy.pythonLib=/usr/lib/python3.10/config-3.10-x86_64-linux-gnu/libpython3.10.so -Djpy.pythonPrefix=/opt/deephaven/venv -Djpy.pythonExecutable=/opt/deephaven/venv/bin/python -DPythonDeephavenSession.initScripts= -DLoggerFactory.silenceOnProcessEnvironment=true -Dstdout.toLogBuffer=false -Dstderr.toLogBuffer=false -Dlogback.configurationFile=logback-minimal.xml -Xrs -XX:+UnlockDiagnosticVMOptions -XX:CompilerDirectivesFile=/opt/deephaven/venv/lib/python3.10/site-packages/deephaven_server/jars/dh-compiler-directives.txt -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:+UseStringDeduplication -XX:GCLockerRetryAllocationCount=128 -DPyObject.cleanup_on_thread=false --add-opens=java.base/java.nio=ALL-UNNAMED --add-exports=java.management/sun.management=ALL-UNNAMED --add-exports=java.base/jdk.internal.misc=ALL-UNNAMED 

Host: AMD Ryzen Threadripper 1950X 16-Core Processor, 32 cores, 31G, Ubuntu 22.04.4 LTS
Time: Tue Aug 27 21:37:11 2024 UTC elapsed time: 4.490472 seconds (0d 0h 0m 4s)

---------------  T H R E A D  ---------------

Current thread (0x000072dd8801a2f0):  WorkerThread "GC Thread#13"   [id=2267, stack(0x000072dde68ea000,0x000072dde69ea000) (1024K)]

Stack: [0x000072dde68ea000,0x000072dde69ea000],  sp=0x000072dde69e75f8,  free space=1013k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xc84e63]  Method::bcp_from(unsigned char*) const+0x3
V  [libjvm.so+0x926043]  JavaThread::oops_do_frames(OopClosure*, CodeBlobClosure*) [clone .part.0]+0xd3
V  [libjvm.so+0xf79625]  Thread::oops_do(OopClosure*, CodeBlobClosure*)+0x95
V  [libjvm.so+0xf85ba2]  Threads::possibly_parallel_oops_do(bool, OopClosure*, CodeBlobClosure*)+0x102
V  [libjvm.so+0x83ab72]  G1RootProcessor::process_java_roots(G1RootClosures*, G1GCPhaseTimes*, unsigned int)+0x72
V  [libjvm.so+0x83b251]  G1RootProcessor::evacuate_roots(G1ParScanThreadState*, unsigned int)+0x61
V  [libjvm.so+0x84f302]  G1EvacuateRegionsTask::scan_roots(G1ParScanThreadState*, unsigned int)+0x22
V  [libjvm.so+0x84f569]  G1EvacuateRegionsBaseTask::work(unsigned int)+0x89
V  [libjvm.so+0x1037140]  WorkerThread::run()+0x80
V  [libjvm.so+0xf79b58]  Thread::call_run()+0xa8
V  [libjvm.so+0xd048ea]  thread_native_entry(Thread*)+0xda
JavaThread 0x00005a69138b6b70 (nid = 2324) was being processed
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  0x000072de0da23991

[error occurred during error reporting (printing target Java thread stack), id 0xb, SIGSEGV (0xb) at pc=0x000072de1e5a08d5]

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000029

Registers:
RAX=0x000072dd9156abb0, RBX=0x0000000000000000, RCX=0x000072dd8801a2f0, RDX=0x000072de203ae4a0
RSP=0x000072dde69e75f8, RBP=0x000072dde69e76e0, RSI=0x000072dd9156abb0, RDI=0x0000000000000000
R8 =0x0000000000000000, R9 =0x000072dde69e7778, R10=0x000072de1f472000, R11=0x0000000000000000
R12=0x000072dbd8001388, R13=0x000072dde69e7740, R14=0x000072dde69e7740, R15=0x000072dde69e7778
RIP=0x000072de1ea84e63, EFLAGS=0x0000000000010246, CSGSFS=0x002b000000000033, ERR=0x0000000000000004
  TRAPNO=0x000000000000000e

Register to memory mapping:

RAX={method} {0x000072dd9156abb0} 'newConnection' '(Lorg/eclipse/jetty/server/Connector;Lorg/eclipse/jetty/io/EndPoint;)Lorg/eclipse/jetty/io/Connection;' in 'org/eclipse/jetty/http2/server/AbstractHTTP2ServerConnectionFactory'
RBX=0x0 is null
RCX=0x000072dd8801a2f0 points into unknown readable memory: 0x000072de1f2dcc00 | 00 cc 2d 1f de 72 00 00
RDX=0x000072de203ae4a0 points into unknown readable memory: 0x0000000000000040 | 40 00 00 00 00 00 00 00
RSP=0x000072dde69e75f8 points into unknown readable memory: 0x000072de1e59fad6 | d6 fa 59 1e de 72 00 00
RBP=0x000072dde69e76e0 points into unknown readable memory: 0x000072dde69e8a30 | 30 8a 9e e6 dd 72 00 00
RSI={method} {0x000072dd9156abb0} 'newConnection' '(Lorg/eclipse/jetty/server/Connector;Lorg/eclipse/jetty/io/EndPoint;)Lorg/eclipse/jetty/io/Connection;' in 'org/eclipse/jetty/http2/server/AbstractHTTP2ServerConnectionFactory'
RDI=0x0 is null
R8 =0x0 is null
R9 =0x000072dde69e7778 points into unknown readable memory: 0x0000000000000000 | 00 00 00 00 00 00 00 00
R10=0x000072de1f472000 points into unknown readable memory: 0x0100050403020100 | 00 01 02 03 04 05 00 01
R11=0x0 is null
R12=0x000072dbd8001388 points into unknown readable memory: 0x000072de1f21c0f8 | f8 c0 21 1f de 72 00 00
R13=0x000072dde69e7740 points into unknown readable memory: 0x000072dd395fe230 | 30 e2 5f 39 dd 72 00 00
R14=0x000072dde69e7740 points into unknown readable memory: 0x000072dd395fe230 | 30 e2 5f 39 dd 72 00 00
R15=0x000072dde69e7778 points into unknown readable memory: 0x0000000000000000 | 00 00 00 00 00 00 00 00

All three of these are reproducible, I've seen each multiple times now while testing locally. None contain any Deephaven code or other recognizable library bytecode that we depend on, except the third, which apparently has jetty objects in registers. With that said, also note that all but the first have truncated stack traces, with the message error occurred during error reporting (printing target Java thread stack).


To test this, I start the docker image that was built for the test, and run this command:

time while python -m xmlrunner discover tests; do :; done

This runs until the test command fails, and logs the time it took until failure was hit.

The bug seems to require a few things to reproduce, which is to say that changing these things stopped the crash from happening on repeated runs:


While removing various parts of the EmbeddedServer and DeephavenApiServer, a new error was discovered:

Fatal Python error: _enter_buffered_busy: could not acquire lock for <_io.BufferedWriter name='<stdout>'> at interpreter shutdown, possibly due to daemon threads
Python runtime state: finalizing (tstate=0x00005743dcc56960)

Current thread 0x00007ddc26999000 (most recent call first):
  <no Python frame>

Extension modules: jpy, numpy._core._multiarray_umath, numpy._core._multiarray_tests, numpy.linalg._umath_linalg, pyarrow.lib, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box (total: 62)
Aborted (core dumped)

This is apparently a known CPython issue, where during shutdown another thread can write to stdout/stderr and cause a crash: https://github.com/python/cpython/issues/86883. This is related to EmbeddedServer redirecting Java's System.out and System.err to python's sys.out and sys.err, as commenting out that redirection prevents this issue: https://github.com/deephaven/deephaven-core/blob/d7034dd28bbefcd1d0515305448ca65e8c72ef5e/py/embedded-server/java-runtime/src/main/java/io/deephaven/python/server/EmbeddedServer.java#L139-L142

It seems possible to mitigate this through the use of atexit.register(), calling into Java to close System.out and System.err. Ideally we might go further here and call jpy's destroy_jvm() function, but this hangs indefinitely at this time.

niloc132 commented 2 weeks ago

It looks like invoking the DH ShutdownManager in the atexit hook largely resolves this - the process no longer (in 2 hrs of consecutive 3-4s runs) exits with a code other than zero... but it does very intermittently encounter a SIGSEGV inside of Java. This is partially written to disk before the python process ends, so the bad exit code never happens (and the rest of the dump isn't completed either).

This new error seems to be that most of Python is shut down but the cleanup thread is still running. It would be nice if we could trigger this thread shutdown directly, either from python or from our own Java shutdown code.

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000647a5cbc7480, pid=122266, tid=122351
#
# JRE version: OpenJDK Runtime Environment Temurin-21.0.4+7 (21.0.4+7) (build 21.0.4+7-LTS)
# Java VM: OpenJDK 64-Bit Server VM Temurin-21.0.4+7 (21.0.4+7-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C  [python+0x230480]
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" (or dumping to /project/core.122266)
#
# If you would like to submit a bug report, please visit:
#   https://github.com/adoptium/adoptium-support/issues
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

---------------  S U M M A R Y ------------

Command Line: -Djpy.jpyLib=/opt/deephaven/venv/lib/python3.10/site-packages/jpy.cpython-310-x86_64-linux-gnu.so -Djpy.jdlLib=/opt/deephaven/venv/lib/python3.10/site-packages/jdl.cpython-310-x86_64-linux-gnu.so -Djpy.pythonLib=/usr/lib/python3.10/config-3.10-x86_64-linux-gnu/libpython3.10.so -Djpy.pythonPrefix=/opt/deephaven/venv -Djpy.pythonExecutable=/opt/deephaven/venv/bin/python -DPythonDeephavenSession.initScripts= -DLoggerFactory.silenceOnProcessEnvironment=true -Dstdout.toLogBuffer=false -Dstderr.toLogBuffer=false -Dlogback.configurationFile=logback-minimal.xml -Xrs -XX:+UnlockDiagnosticVMOptions -XX:CompilerDirectivesFile=/opt/deephaven/venv/lib/python3.10/site-packages/deephaven_server/jars/dh-compiler-directives.txt -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:+UseStringDeduplication -XX:GCLockerRetryAllocationCount=128 --add-opens=java.base/java.nio=ALL-UNNAMED --add-exports=java.management/sun.management=ALL-UNNAMED --add-exports=java.base/jdk.internal.misc=ALL-UNNAMED 

Host: AMD Ryzen Threadripper 1950X 16-Core Processor, 32 cores, 31G, Ubuntu 22.04.4 LTS
Time: Fri Aug 30 15:34:45 2024 UTC elapsed time: 3.374934 seconds (0d 0h 0m 3s)

---------------  T H R E A D  ---------------

Current thread (0x0000647a68ea7020):  JavaThread "PyObject-cleanup" daemon [_thread_in_native, id=122351, stack(0x000071ea4c5c7000,0x000071ea4c6c7000) (1024K)]

Stack: [0x000071ea4c5c7000,0x000071ea4c6c7000],  sp=0x000071ea4c6c5580,  free space=1017k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [python+0x230480]
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  org.jpy.PyLib.callAndReturnValue(JZLjava/lang/String;I[Ljava/lang/Object;[Ljava/lang/Class;Ljava/lang/Class;)Ljava/lang/Object;+0
j  org.jpy.PyProxyHandler.invoke(Ljava/lang/Object;Ljava/lang/reflect/Method;[Ljava/lang/Object;)Ljava/lang/Object;+291
j  org.jpy.$Proxy12.cleanupOnlyUseFromGIL()I+9
j  org.jpy.PyObjectReferences.cleanupThreadLogic()V+15
j  org.jpy.PyObjectReferences$$Lambda+0x000071ea043b8c80.run()V+4
j  java.lang.Thread.runWith(Ljava/lang/Object;Ljava/lang/Runnable;)V+5 java.base@21.0.4
j  java.lang.Thread.run()V+19 java.base@21.0.4
v  ~StubRoutines::call_stub 0x000071ea74617cc6

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000010

Registers:
RAX=0x0000000000000000, RBX=0x0000000000000001, RCX=0x0000000000000001, RDX=0x000071ea8616f5f0
RSP=0x000071ea4c6c5580, RBP=0x0000647a5cf38ae0, RSI=0x0000000000000001, RDI=0x0000000000000000
R8 =0x0000000000000000, R9 =0x0000000000000000, R10=0x000071ea74623991, R11=0x000000060c9e75d8
R12=0x0000647a68ea73d8, R13=0x0000000000000000, R14=0x000071ea8616f5f0, R15=0x000071ea4c6c56d8
RIP=0x0000647a5cbc7480, EFLAGS=0x0000000000010246, CSGSFS=0x002b000000000033, ERR=0x0000000000000004
  TRAPNO=0x000000000000000e

Register to memory mapping:

RAX=0x0 is null
RBX=0x0000000000000001 is an unknown value
RCX=0x0000000000000001 is an unknown value
RDX=0x000071ea8616f5f0 points into unknown readable memory: 0x0000000000000001 | 01 00 00 00 00 00 00 00
RSP=0x000071ea4c6c5580 is pointing into the stack for thread: 0x0000647a68ea7020
RBP=0x0000647a5cf38ae0: _PyRuntime+0x0000000000000000 in python at 0x0000647a5c997000
RSI=0x0000000000000001 is an unknown value
RDI=0x0 is null
R8 =0x0 is null
R9 =0x0 is null
R10=0x000071ea74623991 is at code_begin+1009 in an Interpreter codelet
native method entry point (kind = native)  [0x000071ea746235a0, 0x000071ea74624010]  2672 bytes
R11=0x000000060c9e75d8 is an oop: java.lang.Class 
{0x000000060c9e75d8} - klass: 'java/lang/Class'
 - ---- fields (total size 16 words):
 - private volatile transient 'classRedefinedCount' 'I' @12  0 (0x00000000)
 - injected 'klass' 'J' @16  125249903401936 (0x000071ea04001bd0)
 - injected 'array_klass' 'J' @24  0 (0x0000000000000000)
 - injected 'oop_size' 'I' @32  16 (0x00000010)
 - injected 'static_oop_field_count' 'I' @36  2 (0x00000002)
 - private volatile transient 'cachedConstructor' 'Ljava/lang/reflect/Constructor;' @40  null (0x00000000)
 - private transient 'name' 'Ljava/lang/String;' @44  "org.jpy.PyLib"{0x000000060c9e7658} (0xc193cecb)
 - private transient 'module' 'Ljava/lang/Module;' @48  a 'java/lang/Module'{0x000000060c46b108} (0xc188d621)
 - private final 'classLoader' 'Ljava/lang/ClassLoader;' @52  a 'jdk/internal/loader/ClassLoaders$AppClassLoader'{0x000000060c454120} (0xc188a824)
 - private transient 'classData' 'Ljava/lang/Object;' @56  null (0x00000000)
 - private transient 'packageName' 'Ljava/lang/String;' @60  "org.jpy"{0x000000060c46b140} (0xc188d628)
 - private final 'componentType' 'Ljava/lang/Class;' @64  null (0x00000000)
 - private volatile transient 'reflectionData' 'Ljava/lang/ref/SoftReference;' @68  null (0x00000000)
 - private volatile transient 'genericInfo' 'Lsun/reflect/generics/repository/ClassRepository;' @72  null (0x00000000)
 - private volatile transient 'enumConstants' '[Ljava/lang/Object;' @76  null (0x00000000)
 - private volatile transient 'enumConstantDirectory' 'Ljava/util/Map;' @80  null (0x00000000)
 - private volatile transient 'annotationData' 'Ljava/lang/Class$AnnotationData;' @84  null (0x00000000)
 - private volatile transient 'annotationType' 'Lsun/reflect/annotation/AnnotationType;' @88  null (0x00000000)
 - transient 'classValueMap' 'Ljava/lang/ClassValue$ClassValueMap;' @92  null (0x00000000)
 - injected 'protection_domain' 'Ljava/lang/Object;' @96  a 'java/security/ProtectionDomain'{0x000000060c46b170} (0xc188d62e)
 - injected 'signers_name' 'Ljava/lang/Object;' @100  null (0x00000000)
 - injected 'source_file' 'Ljava/lang/Object;' @104  null (0x00000000)
 - signature: Lorg/jpy/PyLib;
 - ---- static fields (2):
 - private static final 'DEBUG' 'Z' @120  false (0x00)
 - private static final 'ON_WINDOWS' 'Z' @121  false (0x00)
 - private static final 'STOP_IS_NO_OP' 'Z' @122  false (0x00)
 - private static 'dllFilePath' 'Ljava/lang/String;' @112  "/opt/deephaven/venv/lib/python3.10/site-packages/jpy.cpython-310-x86_64-linux-gnu.so"{0x000000060c75a1c0} (0xc18eb438)
 - private static 'dllProblem' 'Ljava/lang/Throwable;' @116  null (0x00000000)
 - private static 'dllLoaded' 'Z' @123  true (0x01)
R12=0x0000647a68ea73d8 points into unknown readable memory: 0x000071ea85ee7160 | 60 71 ee 85 ea 71 00 00
R13=0x0 is null
R14=0x000071ea8616f5f0 points into unknown readable memory: 0x0000000000000001 | 01 00 00 00 00 00 00 00
R15=0x000071ea4c6c56d8 is pointing into the stack for thread: 0x0000647a68ea7020
<end of file>

Edit: It appears that setting -DPyObject.cleanup_on_thread=false does prevent this entirely, will let tests run in the background for a while to confirm