NeuronRobotics / nrjavaserial

A Java Serial Port system. This is a fork of the RXTX project that uses in jar loading of the native code.
Other
344 stars 142 forks source link

Crash in `interruptEventLoop` trying to dereference garbage pointer #251

Open claui opened 4 months ago

claui commented 4 months ago

Summary

While shutting down the JVM in a heavily multithreaded context, we’ve been encountering crashes in interruptEventLoop:

# Problematic frame:
# C  [libNRJavaSerial.so+0x818c]  Java_gnu_io_RXTXPort_interruptEventLoop+0x4c
[…]
siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000d00000000
Details ```plain # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007f4c14be818c, pid=92950, tid=0x00007f4bd2649700 # # JRE version: OpenJDK Runtime Environment (8.0_342-b07) (build 1.8.0_342-b07) # Java VM: OpenJDK 64-Bit Server VM (25.342-b07 mixed mode linux-amd64 compressed oops) # Problematic frame: # C [libNRJavaSerial.so+0x818c] Java_gnu_io_RXTXPort_interruptEventLoop+0x4c # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # If you would like to submit a bug report, please visit: # https://github.com/corretto/corretto-8/issues/ # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug. # --------------- T H R E A D --------------- Current thread (0x00007f4bf063b800): JavaThread "SpringContextShutdownHook" daemon [_thread_in_native, id=621408, stack(0x00007f4bd2611000,0x00007f4bd264a000)] siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000d00000000 Registers: RAX=0x00000000000000c3, RBX=0x00000000000000c3, RCX=0x0000000000000000, RDX=0x0000000d00000000 RSP=0x00007f4bd2647bd0, RBP=0x00007f4c14dee380, RSI=0x00007f4c14beaad6, RDI=0x00007f4c14beaad6 R8 =0x00007f4c3529d6e0, R9 =0x000000000000000b, R10=0x00007f4c14ffb2bc, R11=0x00007f4c34cb3b80 R12=0x00007f4c14beaf23, R13=0x00007f4c14ffdc80, R14=0x00007f4bd2647c60, R15=0x00007f4bf063b800 RIP=0x00007f4c14be818c, EFLAGS=0x0000000000010206, CSGSFS=0x002b000000000033, ERR=0x0000000000000004 TRAPNO=0x000000000000000e Top of Stack: (sp=0x00007f4bd2647bd0) 0x00007f4bd2647bd0: 00007f4c14ffdc80 00007f4bd2647c40 0x00007f4bd2647be0: 0000000000000000 00007f4c272a01a8 0x00007f4bd2647bf0: 00007f4bd2647c40 00007f4c299421a6 0x00007f4bd2647c00: 00007f4bd2647c00 0000000000000000 0x00007f4bd2647c10: 00007f4bd2647c60 00007f4c14fc5f30 0x00007f4bd2647c20: 0000000000000000 00007f4c14ffdc80 0x00007f4bd2647c30: 0000000000000000 00007f4bd2647c60 0x00007f4bd2647c40: 00007f4bd2647ca8 00007f4c2728b09d 0x00007f4bd2647c50: 0000000000000000 00007f4c27299a78 0x00007f4bd2647c60: 00000000fc6b3f20 00007f4bd2647c68 0x00007f4bd2647c70: 00007f4c14ffe256 00007f4bd2647cc0 0x00007f4bd2647c80: 00007f4c14fc5f30 0000000000000000 0x00007f4bd2647c90: 00007f4c14ffe2d8 00007f4bd2647c60 0x00007f4bd2647ca0: 00007f4bd2647cc0 00007f4bd2647d08 0x00007f4bd2647cb0: 00007f4c2728b09d 0000000000000000 0x00007f4bd2647cc0: 00000000fc6b3f20 00007f4bd2647cc8 0x00007f4bd2647cd0: 00007f4c14ffee03 00007f4bd2647d28 0x00007f4bd2647ce0: 00007f4c14fc5f30 0000000000000000 0x00007f4bd2647cf0: 00007f4c14ffeed8 00007f4bd2647cc0 0x00007f4bd2647d00: 00007f4bd2647d28 00007f4bd2647d70 0x00007f4bd2647d10: 00007f4c2728b09d 0000000000000000 0x00007f4bd2647d20: 0000000000000000 00000000fc6b3f20 0x00007f4bd2647d30: 00007f4bd2647d30 00007f4c15867f52 0x00007f4bd2647d40: 00007f4bd2647d88 00007f4c14fb6bb0 0x00007f4bd2647d50: 0000000000000000 00007f4c14fef028 0x00007f4bd2647d60: 00007f4bd2647d28 00007f4bd2647d88 0x00007f4bd2647d70: 00007f4bd2647dd0 00007f4c2728b09d 0x00007f4bd2647d80: 0000000000000000 00000000fc699b60 0x00007f4bd2647d90: 00007f4bd2647d90 00007f4c153df16c 0x00007f4bd2647da0: 00007f4bd2647de8 00007f4c153df5d0 0x00007f4bd2647db0: 0000000000000000 00007f4c153df1c8 0x00007f4bd2647dc0: 00007f4bd2647d88 00007f4bd2647de8 Instructions: (pc=0x00007f4c14be818c) 0x00007f4c14be816c: 00 00 00 00 48 8b 55 00 48 85 d2 75 13 eb 3d 0f 0x00007f4c14be817c: 1f 44 00 00 48 8b 92 c8 00 00 00 48 85 d2 74 34 0x00007f4c14be818c: 39 1a 75 f0 c7 82 98 00 00 00 01 00 00 00 48 8d 0x00007f4c14be819c: 3d 87 38 00 00 c7 82 68 01 00 00 01 00 00 00 5b Register to memory mapping: RAX=0x00000000000000c3 is an unknown value RBX=0x00000000000000c3 is an unknown value RCX=0x0000000000000000 is an unknown value RDX=0x0000000d00000000 is an unknown value RSP=0x00007f4bd2647bd0 is pointing into the stack for thread: 0x00007f4bf063b800 RBP=0x00007f4c14dee380: master_index+0 in /var/lib/tomcat8/temp/libNRJavaSerial_tomcat8_0/libNRJavaSerial.so at 0x00007f4c14be0000 RSI=0x00007f4c14beaad6: in /var/lib/tomcat8/temp/libNRJavaSerial_tomcat8_0/libNRJavaSerial.so at 0x00007f4c14be0000 RDI=0x00007f4c14beaad6: in /var/lib/tomcat8/temp/libNRJavaSerial_tomcat8_0/libNRJavaSerial.so at 0x00007f4c14be0000 R8 =0x00007f4c3529d6e0: in /usr/lib/jvm/jre1.8/jre/lib/amd64/server/libjvm.so at 0x00007f4c34295000 R9 =0x000000000000000b is an unknown value R10=0x00007f4c14ffb2bc is pointing into metadata R11=0x00007f4c34cb3b80: in /usr/lib/jvm/jre1.8/jre/lib/amd64/server/libjvm.so at 0x00007f4c34295000 R12=0x00007f4c14beaf23: in /var/lib/tomcat8/temp/libNRJavaSerial_tomcat8_0/libNRJavaSerial.so at 0x00007f4c14be0000 R13={method} {0x00007f4c14ffdc80} 'interruptEventLoop' '()V' in 'gnu/io/RXTXPort' R14=0x00007f4bd2647c60 is pointing into the stack for thread: 0x00007f4bf063b800 R15=0x00007f4bf063b800 is a thread Stack: [0x00007f4bd2611000,0x00007f4bd264a000], sp=0x00007f4bd2647bd0, free space=218k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) C [libNRJavaSerial.so+0x818c] Java_gnu_io_RXTXPort_interruptEventLoop+0x4c Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) j gnu.io.RXTXPort.interruptEventLoop()V+0 j gnu.io.RXTXPort.removeEventListener()V+62 j gnu.io.RXTXPort.close()V+99 j gnu.io.NRSerialPort.disconnect()V+18 j ***redacted*** j ***redacted*** j ***redacted*** v ~StubRoutines::call_stub J 1834 sun.reflect.NativeMethodAccessorImpl.invoke0(Ljava/lang/reflect/Method;Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; (0 bytes) @ 0x00007f4c2795f2f7 [0x00007f4c2795f280+0x77] j sun.reflect.NativeMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+100 J 11853 C2 java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; (62 bytes) @ 0x00007f4c2955c1d4 [0x00007f4c2955c120+0xb4] j org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor$LifecycleElement.invoke(Ljava/lang/Object;)V+16 j org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor$LifecycleMetadata.invokeDestroyMethods(Ljava/lang/Object;Ljava/lang/String;)V+122 j org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor.postProcessBeforeDestruction(Ljava/lang/Object;Ljava/lang/String;)V+12 j org.springframework.beans.factory.support.DisposableBeanAdapter.destroy()V+48 j org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroyBean(Ljava/lang/String;Lorg/springframework/beans/factory/DisposableBean;)V+133 j org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroySingleton(Ljava/lang/String;)V+41 j org.springframework.beans.factory.support.DefaultListableBeanFactory.destroySingleton(Ljava/lang/String;)V+2 j org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroySingletons()V+105 j org.springframework.beans.factory.support.DefaultListableBeanFactory.destroySingletons()V+1 j org.springframework.context.support.AbstractApplicationContext.destroyBeans()V+4 j org.springframework.context.support.AbstractApplicationContext.doClose()V+133 j org.springframework.context.support.AbstractApplicationContext$1.run()V+14 v ~StubRoutines::call_stub --------------- P R O C E S S --------------- Java Threads: ( => current thread ) =>0x00007f4bf063b800 JavaThread "SpringContextShutdownHook" daemon [_thread_in_native, id=621408, stack(0x00007f4bd2611000,0x00007f4bd264a000)] 0x00007f4c04026800 JavaThread "SIGTERM handler" daemon [_thread_blocked, id=621404, stack(0x00007f4bd445b000,0x00007f4bd4494000)] 0x00007f4c18294800 JavaThread "RXTXPortMonitor(/dev/ttyS0)" daemon [_thread_in_native, id=128537, stack(0x00007f4bd44cd000,0x00007f4bd4506000)] [241 more Java threads omitted] [details omitted] --------------- S Y S T E M --------------- OS:PRETTY_NAME="Debian GNU/Linux 11 (bullseye)" NAME="Debian GNU/Linux" VERSION_ID="11" VERSION="11 (bullseye)" VERSION_CODENAME=bullseye ID=debian HOME_URL="https://www.debian.org/" SUPPORT_URL="https://www.debian.org/support" BUG_REPORT_URL="https://bugs.debian.org/" uname:Linux 5.10.0-27-amd64 #1 SMP Debian 5.10.205-2 (2023-12-31) x86_64 libc:glibc 2.31 NPTL 2.31 rlimit: STACK 8192k, CORE 0k, NPROC 6141, NOFILE 524288, AS infinity load average:0,42 0,54 0,45 [details omitted] Memory: 4k page, physical 1607956k(77304k free), swap 2440188k(1500068k free) vm_info: OpenJDK 64-Bit Server VM (25.342-b07) for linux-amd64 JRE (1.8.0_342-b07), built on Jul 27 2022 21:29:28 by "jenkins" with gcc 7.3.1 20180303 (Red Hat 7.3.1-5) time: Thu Feb 8 09:24:33 2024 timezone: CET elapsed time: 57181.688148 seconds (0d 15h 53m 1s) ```

Observations

Note that the address being dereferenced doesn’t even remotely resemble a memory location nor a near-zero value, but instead looks more like unrelated, non-address data:

si_addr: 0x0000000d00000000

The instruction pointer at the time of the crash:

Instructions: (pc=0x00007f4c14be818c)
0x00007f4c14be816c:   00 00 00 00 48 8b 55 00 48 85 d2 75 13 eb 3d 0f
0x00007f4c14be817c:   1f 44 00 00 48 8b 92 c8 00 00 00 48 85 d2 74 34
0x00007f4c14be818c:   39 1a 75 f0 c7 82 98 00 00 00 01 00 00 00 48 8d
0x00007f4c14be819c:   3d 87 38 00 00 c7 82 68 01 00 00 01 00 00 00 5b 

points to this piece of code (obtained via objdump -d):

Disassembly of section .text:

0000000000008140 <Java_gnu_io_RXTXPort_interruptEventLoop>:
    […]
    8187:   48 85 d2                test   %rdx,%rdx
    818a:   74 34                   je     81c0 <Java_gnu_io_RXTXPort_interruptEventLoop+0x80>
 => 818c:   39 1a                   cmp    %ebx,(%rdx)
    818e:   75 f0                   jne    8180 <Java_gnu_io_RXTXPort_interruptEventLoop+0x40>
    8190:   c7 82 98 00 00 00 01    movl   $0x1,0x98(%rdx)
    […]
Details ```asm 0000000000008140 : 8140: 41 54 push %r12 8142: 48 8d 0d ce 2a 00 00 lea 0x2ace(%rip),%rcx # ac17 <_fini+0x18b> 8149: 48 8d 15 84 29 00 00 lea 0x2984(%rip),%rdx # aad4 <_fini+0x48> 8150: 55 push %rbp 8151: 4c 8d 25 cb 2d 00 00 lea 0x2dcb(%rip),%r12 # af23 <_fini+0x497> 8158: 53 push %rbx 8159: e8 42 b8 ff ff callq 39a0 815e: 48 8b 2d 8b 5e 20 00 mov 0x205e8b(%rip),%rbp # 20dff0 8165: 89 c3 mov %eax,%ebx 8167: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1) 816e: 00 00 8170: 48 8b 55 00 mov 0x0(%rbp),%rdx 8174: 48 85 d2 test %rdx,%rdx 8177: 75 13 jne 818c 8179: eb 3d jmp 81b8 817b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 8180: 48 8b 92 c8 00 00 00 mov 0xc8(%rdx),%rdx 8187: 48 85 d2 test %rdx,%rdx 818a: 74 34 je 81c0 818c: 39 1a cmp %ebx,(%rdx) 818e: 75 f0 jne 8180 8190: c7 82 98 00 00 00 01 movl $0x1,0x98(%rdx) 8197: 00 00 00 819a: 48 8d 3d 87 38 00 00 lea 0x3887(%rip),%rdi # ba28 <_fini+0xf9c> 81a1: c7 82 68 01 00 00 01 movl $0x1,0x168(%rdx) 81a8: 00 00 00 81ab: 5b pop %rbx 81ac: 5d pop %rbp 81ad: 41 5c pop %r12 81af: e9 cc b8 ff ff jmpq 3a80 81b4: 0f 1f 40 00 nopl 0x0(%rax) 81b8: 4c 89 e7 mov %r12,%rdi 81bb: e8 c0 b8 ff ff callq 3a80 81c0: 48 8d 3d 5e 2d 00 00 lea 0x2d5e(%rip),%rdi # af25 <_fini+0x499> 81c7: e8 b4 b8 ff ff callq 3a80 81cc: bf e8 03 00 00 mov $0x3e8,%edi 81d1: e8 2a b9 ff ff callq 3b00 81d6: eb 98 jmp 8170 81d8: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) 81df: 00 ```

Which in turn matches the index->fd != fd expression here:

https://github.com/NeuronRobotics/nrjavaserial/blob/0df8b60485a56d7698b71183237b5615d02a8194/src/main/c/src/SerialImp.c#L4935-L4936

In other words, while navigating the chain of event_info_structs, the code in interruptEventLoop appears to be bumping into a garbage next pointer and tries to dereference it, causing the crash.

Possibly related to issue #197, which is reporting an infinite loop rather than a crash.

Possible explanation

If I’m reading the source code correctly, each event_info_struct is owned by its monitor thread, and lives in the stack frame of that thread’s eventLoop.

I couldn’t ultimately figure out how garbage could be possibly ending up in the middle of an event_info_struct.

But a possible explanation would be: a monitor thread might be dying before it gets to bail out of its event_loop. Or maybe a TERM or some other signal is beaming the thread out of its event_loop. Either way, the event_loop never gets to run finalize_event_info_struct. This causes a stale pointer to remain in the global chain, even after the stack frame of event_loop is unrolled. The memory might later be reused by another stack frame or stack.

This theory could explain why during a subsequent invocation of interruptEventLoop in some other thread, while navigating the global chain of event_info_structs, the code might be taking some piece of stack garbage as the real thing.

All that is just speculation from static analysis of the code. But if my theory happens to be correct, it might also explain the stalling reported in issue #197; if the stack segment of a reaped monitor thread is immediately reused by another monitor thread, the stack frame address of the new event_loop might perfectly align with the stale stack frame from the past thread’s event_loop, which means its event_info_struct pointer might end up twice in the global chain. This would cause a cycle in the chain, and a subsequent call to interruptEventLoop from a Java thread might become stuck in an infinite cycle.

Fix

Preliminary testing suggests that PR #211 might be fixing this.