Open lhotari opened 1 year ago
The issue had no activity for 30 days, mark with Stale label.
I found this problem in the test environment. Pulsar version: 2.11.1 JDK version: 17.0.6 Perhaps it is related to this problem( https://bugs.openjdk.org/browse/jdk-8285835), so upgrade JDK to 17.0.7. There is no crash for the time being.
The issue had no activity for 30 days, mark with Stale label.
Closed according to https://github.com/apache/pulsar/issues/19307#issuecomment-1541958093. If it comes back, feel free to open a new issue.
Hi, I was able to reproduce this problem on the latest Pulsar 3.0.0 image and would like to re-open this issue.
--set defaultPulsarImageTag=3.0.0
.Messages produced with success. Brokers do not restart.
After around 30 000 messages, Brokers keep restarting with the following error:
INFO 2023-07-10T08:38:13.117316491Z [resource.labels.containerName: pulsar-broker] #
INFO 2023-07-10T08:38:13.117589437Z [resource.labels.containerName: pulsar-broker] # A fatal error has been detected by the Java Runtime Environment:
INFO 2023-07-10T08:38:13.118142215Z [resource.labels.containerName: pulsar-broker] #
INFO 2023-07-10T08:38:13.119489360Z [resource.labels.containerName: pulsar-broker] # SIGSEGV (0xb) at pc=0x00007f9aab88e0f3, pid=1, tid=202
INFO 2023-07-10T08:38:13.119805559Z [resource.labels.containerName: pulsar-broker] #
INFO 2023-07-10T08:38:13.120234080Z [resource.labels.containerName: pulsar-broker] # JRE version: OpenJDK Runtime Environment Temurin-17.0.6+10 (17.0.6+10) (build 17.0.6+10)
INFO 2023-07-10T08:38:13.120351288Z [resource.labels.containerName: pulsar-broker] # Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.6+10 (17.0.6+10, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
INFO 2023-07-10T08:38:13.120477437Z [resource.labels.containerName: pulsar-broker] # Problematic frame:
INFO 2023-07-10T08:38:13.165446675Z [resource.labels.containerName: pulsar-broker] # V [libjvm.so+0xacf0f3] PhaseIdealLoop::build_loop_late_post_work(Node*, bool)+0x153
INFO 2023-07-10T08:38:13.166381326Z [resource.labels.containerName: pulsar-broker] #
INFO 2023-07-10T08:38:13.167145592Z [resource.labels.containerName: pulsar-broker] # Core dump will be written. Default location: /core.%e.1.%t
INFO 2023-07-10T08:38:13.167357602Z [resource.labels.containerName: pulsar-broker] #
INFO 2023-07-10T08:38:13.168488365Z [resource.labels.containerName: pulsar-broker] # An error report file with more information is saved as:
INFO 2023-07-10T08:38:13.169308881Z [resource.labels.containerName: pulsar-broker] # /pulsar/hs_err_pid1.log
INFO 2023-07-10T08:38:13.264952361Z [resource.labels.containerName: pulsar-broker] #
INFO 2023-07-10T08:38:13.264992765Z [resource.labels.containerName: pulsar-broker] # Compiler replay data is saved as:
INFO 2023-07-10T08:38:13.265529939Z [resource.labels.containerName: pulsar-broker] # /pulsar/replay_pid1.log
INFO 2023-07-10T08:38:13.265759628Z [resource.labels.containerName: pulsar-broker] #
INFO 2023-07-10T08:38:13.265899593Z [resource.labels.containerName: pulsar-broker] # If you would like to submit a bug report, please visit:
INFO 2023-07-10T08:38:13.266143760Z [resource.labels.containerName: pulsar-broker] # https://github.com/adoptium/adoptium-support/issues
INFO 2023-07-10T08:38:13.266641554Z [resource.labels.containerName: pulsar-broker] #
INFO 2023-07-10T08:38:13.266960230Z [resource.labels.containerName: pulsar-broker] {}
INFO 2023-07-10T08:38:13.267208487Z [resource.labels.containerName: pulsar-broker] [error occurred during error reporting (), id 0xb, SIGSEGV (0xb) at pc=0x00007f9aac126941]
INFO 2023-07-10T08:38:13.267282513Z [resource.labels.containerName: pulsar-broker] {}
Important logs:
Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.6+10 (17.0.6+10, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
V [libjvm.so+0xacf0f3] PhaseIdealLoop::build_loop_late_post_work(Node*, bool)+0x153
Seems like exactly the bug https://bugs.openjdk.org/browse/JDK-8285835?attachmentViewMode=list which is fixed in JDK 17.0.7
, while the apachepulsar/pulsar:3.0.0
runs on affected JVM version:
Seems like releasing a new Docker image with newer JDK should fix the problem (I've noticed that Ubuntu version was upgarded). Probably building the image right now should download the latest temurin-17-jdk
.
Would that be possible to release bugfix image with never JDK?
Thank you.
it happened again in pulsar-metadata tests.
Current thread (0x00007f927808ed70): JavaThread "C2 CompilerThread0" daemon [_thread_in_native, id=4012, stack(0x00007f8ffab00000,0x00007f8ffac00000)]
Current CompileTask:
C2: 365906 27888 4 org.apache.pulsar.metadata.bookkeeper.LongHierarchicalLedgerRangeIterator$InnerIterator::advance (207 bytes)
Stack: [0x00007f8ffab00000,0x00007f8ffac00000], sp=0x00007f8ffabfac50, free space=1003k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0xad5113] PhaseIdealLoop::build_loop_late_post_work(Node*, bool)+0xe3
V [libjvm.so+0xad5bb0] PhaseIdealLoop::build_loop_late(VectorSet&, Node_List&, Node_Stack&)+0x180
V [libjvm.so+0xad652e] PhaseIdealLoop::build_and_optimize()+0x7ae
V [libjvm.so+0x5d6c8f] PhaseIdealLoop::optimize(PhaseIterGVN&, LoopOptsMode)+0x16f
V [libjvm.so+0x5d4b26] Compile::Optimize()+0xde6
V [libjvm.so+0x5d64e5] Compile::Compile(ciEnv*, ciMethod*, int, bool, bool, bool, bool, bool, DirectiveSet*)+0xe65
V [libjvm.so+0x5162c9] C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0xe9
V [libjvm.so+0x5df632] CompileBroker::invoke_compiler_on_method(CompileTask*)+0xef2
V [libjvm.so+0x5e0448] CompileBroker::compiler_thread_loop()+0x678
V [libjvm.so+0xe65584] JavaThread::thread_main_inner()+0x184
V [libjvm.so+0xe68d40] Thread::call_run()+0xc0
V [libjvm.so+0xc1e681] thread_native_entry(Thread*)+0xe1
https://gist.github.com/lhotari/e191a541ca4558b624736b27620f48cd contains the full error report file.
this issue still has label stale, see https://github.com/apache/pulsar/issues/21458
More recent crashes with similar stack trace: https://gist.github.com/lhotari/53b72683ad4f339dfbcfd8b9b97062b9
It looks like the issue will be fixed in 17.0.10 with https://bugs.openjdk.org/browse/JDK-8315861 which is the backport of https://bugs.openjdk.org/browse/JDK-8314024 . 17.0.10 is scheduled for Jan 16, 2024 based on https://wiki.openjdk.org/display/JDKUpdates/JDK+17u information.
The fix https://bugs.openjdk.org/browse/JDK-8314024 is already included in 21.0.1 which has already been released. One possibility would be to run Pulsar on Java 21.0.1 to get the fix earlier. An ea build of 17.0.10 is another option. That should be available soon. The third option is to wait until 17.0.10 is released in the last half of January.
There's another JVM crash where it happens with a different type of stack trace, in RocksDB:
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C [librocksdbjni10071231183631223410.so+0x2adff0] rocksdb_iterator_helper(rocksdb::DB*, rocksdb::ReadOptions, rocksdb::ColumnFamilyHandle*)+0x60
j org.rocksdb.RocksDB.iterator(JJ)J+0
j org.rocksdb.RocksDB.newIterator(Lorg/rocksdb/ReadOptions;)Lorg/rocksdb/RocksIterator;+14
reported as #19250
Search before asking
Version
9e698b62
Minimal reproduce step
none, this happens with pulsar-metadata tests. Perhaps related to #19250
What did you expect to see?
No JVM crashes
What did you see instead?
The JVM crashed and caused the build to fail
The hs err file contains:
Anything else?
Full hs err file: https://gist.github.com/lhotari/4948f347a8f70f6a1051212139fcf551 another one: https://gist.github.com/lhotari/f3892d3577e285edc469141577215261
Failed build: https://github.com/lhotari/pulsar/actions/runs/3985274820/jobs/6832729920#step:11:1655
Are you willing to submit a PR?