apache / pulsar

Apache Pulsar - distributed pub-sub messaging system
https://pulsar.apache.org/
Apache License 2.0
14.23k stars 3.58k forks source link

[Bug] JVM crash in unit tests: PhaseIdealLoop::build_loop_late_post_work #19307

Open lhotari opened 1 year ago

lhotari commented 1 year ago

Search before asking

Version

9e698b62

Minimal reproduce step

none, this happens with pulsar-metadata tests. Perhaps related to #19250

What did you expect to see?

No JVM crashes

What did you see instead?

The JVM crashed and caused the build to fail

Error:  Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M3:test (default-test) on project pulsar-metadata: There are test failures.
  Error:  
  Error:  Please refer to /home/runner/work/pulsar/pulsar/pulsar-metadata/target/surefire-reports for the individual test results.
  Error:  Please refer to dump files (if any exist) [date].dump, [date]-jvmRun[N].dump and [date].dumpstream.
  Error:  ExecutionException The forked VM terminated without properly saying goodbye. VM crash or System.exit called?
  Error:  Command was /bin/sh -c cd /home/runner/work/pulsar/pulsar/pulsar-metadata && /usr/lib/jvm/temurin-17-jdk-amd64/bin/java '-javaagent:/home/runner/.m2/repository/org/jacoco/org.jacoco.agent/0.8.8/org.jacoco.agent-0.8.8-runtime.jar=destfile=/home/runner/work/pulsar/pulsar/pulsar-metadata/target/jacoco_2023-01-23T10:19:37Z_4.exec,append=true,includes=org.apache.pulsar.*:org.apache.bookkeeper.mledger.*,excludes=*.proto.*:*.shade.*:*.shaded.*' -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp -XX:+ExitOnOutOfMemoryError -Xmx1G -XX:+UseZGC -Dpulsar.allocator.pooled=true -Dpulsar.allocator.leak_detection=Advanced -Dpulsar.allocator.exit_on_oom=false -Dpulsar.allocator.out_of_memory_policy=FallbackToHeap -Dio.netty.tryReflectionSetAccessible=true --add-opens java.base/jdk.internal.loader=ALL-UNNAMED --add-opens java.base/java.lang=ALL-UNNAMED --add-opens java.base/java.io=ALL-UNNAMED --add-opens java.base/java.util=ALL-UNNAMED --add-opens java.base/sun.net=ALL-UNNAMED --add-opens java.management/sun.management=ALL-UNNAMED --add-opens jdk.management/com.sun.management.internal=ALL-UNNAMED -jar /home/runner/work/pulsar/pulsar/pulsar-metadata/target/surefire/surefirebooter4173021429455748452.jar /home/runner/work/pulsar/pulsar/pulsar-metadata/target/surefire 2023-01-23T10-20-05_807-jvmRun4 surefire348412133721025140tmp surefire_710936654358495701668tmp
  Error:  Error occurred in starting fork, check output in log
  Error:  Process Exit Code: 134
  Error:  Crashed tests:
  Error:  org.apache.pulsar.metadata.bookkeeper.LedgerManagerIteratorTest
  Error:  org.apache.maven.surefire.booter.SurefireBooterForkException: ExecutionException The forked VM terminated without properly saying goodbye. VM crash or System.exit called?
  Error:  Command was /bin/sh -c cd /home/runner/work/pulsar/pulsar/pulsar-metadata && /usr/lib/jvm/temurin-17-jdk-amd64/bin/java '-javaagent:/home/runner/.m2/repository/org/jacoco/org.jacoco.agent/0.8.8/org.jacoco.agent-0.8.8-runtime.jar=destfile=/home/runner/work/pulsar/pulsar/pulsar-metadata/target/jacoco_2023-01-23T10:19:37Z_4.exec,append=true,includes=org.apache.pulsar.*:org.apache.bookkeeper.mledger.*,excludes=*.proto.*:*.shade.*:*.shaded.*' -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp -XX:+ExitOnOutOfMemoryError -Xmx1G -XX:+UseZGC -Dpulsar.allocator.pooled=true -Dpulsar.allocator.leak_detection=Advanced -Dpulsar.allocator.exit_on_oom=false -Dpulsar.allocator.out_of_memory_policy=FallbackToHeap -Dio.netty.tryReflectionSetAccessible=true --add-opens java.base/jdk.internal.loader=ALL-UNNAMED --add-opens java.base/java.lang=ALL-UNNAMED --add-opens java.base/java.io=ALL-UNNAMED --add-opens java.base/java.util=ALL-UNNAMED --add-opens java.base/sun.net=ALL-UNNAMED --add-opens java.management/sun.management=ALL-UNNAMED --add-opens jdk.management/com.sun.management.internal=ALL-UNNAMED -jar /home/runner/work/pulsar/pulsar/pulsar-metadata/target/surefire/surefirebooter4173021429455748452.jar /home/runner/work/pulsar/pulsar/pulsar-metadata/target/surefire 2023-01-23T10-20-05_807-jvmRun4 surefire348412133721025140tmp surefire_710936654358495701668tmp
  Error:  Error occurred in starting fork, check output in log
  Error:  Process Exit Code: 134
  Error:  Crashed tests:
  Error:  org.apache.pulsar.metadata.bookkeeper.LedgerManagerIteratorTest
  Error:    at org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:511)
  Error:    at org.apache.maven.plugin.surefire.booterclient.ForkStarter.runSuitesForkOnceMultiple(ForkStarter.java:383)
  Error:    at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:298)
  Error:    at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:247)
  Error:    at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1161)
  Error:    at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:1002)
  Error:    at org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:848)
  Error:    at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:137)
  Error:    at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2(MojoExecutor.java:370)
  Error:    at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute(MojoExecutor.java:351)
  Error:    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:215)
  Error:    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:171)
  Error:    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:163)
  Error:    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117)
  Error:    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:81)
  Error:    at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:56)
  Error:    at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
  Error:    at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:298)
  Error:    at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:192)
  Error:    at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:105)
  Error:    at org.apache.maven.cli.MavenCli.execute(MavenCli.java:960)
  Error:    at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:293)
  Error:    at org.apache.maven.cli.MavenCli.main(MavenCli.java:196)
  Error:    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  Error:    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  Error:    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  Error:    at java.base/java.lang.reflect.Method.invoke(Method.java:568)
  Error:    at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:282)
  Error:    at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:225)
  Error:    at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:406)
  Error:    at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:347)
  Error:  Caused by: org.apache.maven.surefire.booter.SurefireBooterForkException: The forked VM terminated without properly saying goodbye. VM crash or System.exit called?
  Error:  Command was /bin/sh -c cd /home/runner/work/pulsar/pulsar/pulsar-metadata && /usr/lib/jvm/temurin-17-jdk-amd64/bin/java '-javaagent:/home/runner/.m2/repository/org/jacoco/org.jacoco.agent/0.8.8/org.jacoco.agent-0.8.8-runtime.jar=destfile=/home/runner/work/pulsar/pulsar/pulsar-metadata/target/jacoco_2023-01-23T10:19:37Z_4.exec,append=true,includes=org.apache.pulsar.*:org.apache.bookkeeper.mledger.*,excludes=*.proto.*:*.shade.*:*.shaded.*' -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp -XX:+ExitOnOutOfMemoryError -Xmx1G -XX:+UseZGC -Dpulsar.allocator.pooled=true -Dpulsar.allocator.leak_detection=Advanced -Dpulsar.allocator.exit_on_oom=false -Dpulsar.allocator.out_of_memory_policy=FallbackToHeap -Dio.netty.tryReflectionSetAccessible=true --add-opens java.base/jdk.internal.loader=ALL-UNNAMED --add-opens java.base/java.lang=ALL-UNNAMED --add-opens java.base/java.io=ALL-UNNAMED --add-opens java.base/java.util=ALL-UNNAMED --add-opens java.base/sun.net=ALL-UNNAMED --add-opens java.management/sun.management=ALL-UNNAMED --add-opens jdk.management/com.sun.management.internal=ALL-UNNAMED -jar /home/runner/work/pulsar/pulsar/pulsar-metadata/target/surefire/surefirebooter4173021429455748452.jar /home/runner/work/pulsar/pulsar/pulsar-metadata/target/surefire 2023-01-23T10-20-05_807-jvmRun4 surefire348412133721025140tmp surefire_71093665435849570[1668](https://github.com/lhotari/pulsar/actions/runs/3985274820/jobs/6832729920#step:11:1669)tmp
  Error:  Error occurred in starting fork, check output in log
  Error:  Process Exit Code: 134
  Error:  Crashed tests:
  Error:  org.apache.pulsar.metadata.bookkeeper.LedgerManagerIteratorTest
  Error:    at org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork(ForkStarter.java:670)
  Error:    at org.apache.maven.plugin.surefire.booterclient.ForkStarter.access$600(ForkStarter.java:116)
  Error:    at org.apache.maven.plugin.surefire.booterclient.ForkStarter$1.call(ForkStarter.java:372)
  Error:    at org.apache.maven.plugin.surefire.booterclient.ForkStarter$1.call(ForkStarter.java:348)
  Error:    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
  Error:    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
  Error:    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
  Error:    at java.base/java.lang.Thread.run(Thread.java:833)
  Error:  -> [Help 1]

The hs err file contains:

Current CompileTask:
C2:  66042 10534       4       org.apache.pulsar.metadata.bookkeeper.LongHierarchicalLedgerRangeIterator$InnerIterator::advance (204 bytes)

Stack: [0x00007f65b5279000,0x00007f65b537a000],  sp=0x00007f65b5374dd0,  free space=1007k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xacda43]  PhaseIdealLoop::build_loop_late_post_work(Node*, bool)+0x153
V  [libjvm.so+0xace400]  PhaseIdealLoop::build_loop_late(VectorSet&, Node_List&, Node_Stack&)+0x180
V  [libjvm.so+0xaced7e]  PhaseIdealLoop::build_and_optimize()+0x7ae
V  [libjvm.so+0x5d237f]  PhaseIdealLoop::optimize(PhaseIterGVN&, LoopOptsMode)+0x16f
V  [libjvm.so+0x5d0216]  Compile::Optimize()+0xdc6
V  [libjvm.so+0x5d1bd5]  Compile::Compile(ciEnv*, ciMethod*, int, bool, bool, bool, bool, bool, DirectiveSet*)+0xe65
V  [libjvm.so+0x510cb9]  C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0xe9
V  [libjvm.so+0x5dab62]  CompileBroker::invoke_compiler_on_method(CompileTask*)+0xef2
V  [libjvm.so+0x5db808]  CompileBroker::compiler_thread_loop()+0x508
V  [libjvm.so+0xe59f04]  JavaThread::thread_main_inner()+0x184
V  [libjvm.so+0xe5d5b0]  Thread::call_run()+0xc0
V  [libjvm.so+0xc17381]  thread_native_entry(Thread*)+0xe1

Anything else?

Full hs err file: https://gist.github.com/lhotari/4948f347a8f70f6a1051212139fcf551 another one: https://gist.github.com/lhotari/f3892d3577e285edc469141577215261

Failed build: https://github.com/lhotari/pulsar/actions/runs/3985274820/jobs/6832729920#step:11:1655

Are you willing to submit a PR?

github-actions[bot] commented 1 year ago

The issue had no activity for 30 days, mark with Stale label.

7hong commented 1 year ago

I found this problem in the test environment. Pulsar version: 2.11.1 JDK version: 17.0.6 Perhaps it is related to this problem( https://bugs.openjdk.org/browse/jdk-8285835), so upgrade JDK to 17.0.7. There is no crash for the time being.

github-actions[bot] commented 1 year ago

The issue had no activity for 30 days, mark with Stale label.

tisonkun commented 1 year ago

Closed according to https://github.com/apache/pulsar/issues/19307#issuecomment-1541958093. If it comes back, feel free to open a new issue.

malaskowski commented 1 year ago

Hi, I was able to reproduce this problem on the latest Pulsar 3.0.0 image and would like to re-open this issue.

Search before asking

Version

Minimal reproduce step

  1. Run Apache Pulsar cluster with Helm Chart default values and the only change: --set defaultPulsarImageTag=3.0.0.
  2. Create partitioned topic (3 partitions) and start producing messages (message size is ~30kB).

What did you expect to see?

Messages produced with success. Brokers do not restart.

What did you see instead?

After around 30 000 messages, Brokers keep restarting with the following error:

INFO 2023-07-10T08:38:13.117316491Z [resource.labels.containerName: pulsar-broker] #
INFO 2023-07-10T08:38:13.117589437Z [resource.labels.containerName: pulsar-broker] # A fatal error has been detected by the Java Runtime Environment:
INFO 2023-07-10T08:38:13.118142215Z [resource.labels.containerName: pulsar-broker] #
INFO 2023-07-10T08:38:13.119489360Z [resource.labels.containerName: pulsar-broker] # SIGSEGV (0xb) at pc=0x00007f9aab88e0f3, pid=1, tid=202
INFO 2023-07-10T08:38:13.119805559Z [resource.labels.containerName: pulsar-broker] #
INFO 2023-07-10T08:38:13.120234080Z [resource.labels.containerName: pulsar-broker] # JRE version: OpenJDK Runtime Environment Temurin-17.0.6+10 (17.0.6+10) (build 17.0.6+10)
INFO 2023-07-10T08:38:13.120351288Z [resource.labels.containerName: pulsar-broker] # Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.6+10 (17.0.6+10, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
INFO 2023-07-10T08:38:13.120477437Z [resource.labels.containerName: pulsar-broker] # Problematic frame:
INFO 2023-07-10T08:38:13.165446675Z [resource.labels.containerName: pulsar-broker] # V [libjvm.so+0xacf0f3] PhaseIdealLoop::build_loop_late_post_work(Node*, bool)+0x153
INFO 2023-07-10T08:38:13.166381326Z [resource.labels.containerName: pulsar-broker] #
INFO 2023-07-10T08:38:13.167145592Z [resource.labels.containerName: pulsar-broker] # Core dump will be written. Default location: /core.%e.1.%t
INFO 2023-07-10T08:38:13.167357602Z [resource.labels.containerName: pulsar-broker] #
INFO 2023-07-10T08:38:13.168488365Z [resource.labels.containerName: pulsar-broker] # An error report file with more information is saved as:
INFO 2023-07-10T08:38:13.169308881Z [resource.labels.containerName: pulsar-broker] # /pulsar/hs_err_pid1.log
INFO 2023-07-10T08:38:13.264952361Z [resource.labels.containerName: pulsar-broker] #
INFO 2023-07-10T08:38:13.264992765Z [resource.labels.containerName: pulsar-broker] # Compiler replay data is saved as:
INFO 2023-07-10T08:38:13.265529939Z [resource.labels.containerName: pulsar-broker] # /pulsar/replay_pid1.log
INFO 2023-07-10T08:38:13.265759628Z [resource.labels.containerName: pulsar-broker] #
INFO 2023-07-10T08:38:13.265899593Z [resource.labels.containerName: pulsar-broker] # If you would like to submit a bug report, please visit:
INFO 2023-07-10T08:38:13.266143760Z [resource.labels.containerName: pulsar-broker] # https://github.com/adoptium/adoptium-support/issues
INFO 2023-07-10T08:38:13.266641554Z [resource.labels.containerName: pulsar-broker] #
INFO 2023-07-10T08:38:13.266960230Z [resource.labels.containerName: pulsar-broker] {}
INFO 2023-07-10T08:38:13.267208487Z [resource.labels.containerName: pulsar-broker] [error occurred during error reporting (), id 0xb, SIGSEGV (0xb) at pc=0x00007f9aac126941]
INFO 2023-07-10T08:38:13.267282513Z [resource.labels.containerName: pulsar-broker] {}

Important logs:

Anything else?

Seems like exactly the bug https://bugs.openjdk.org/browse/JDK-8285835?attachmentViewMode=list which is fixed in JDK 17.0.7, while the apachepulsar/pulsar:3.0.0 runs on affected JVM version:

Seems like releasing a new Docker image with newer JDK should fix the problem (I've noticed that Ubuntu version was upgarded). Probably building the image right now should download the latest temurin-17-jdk.

Would that be possible to release bugfix image with never JDK?

Thank you.

lhotari commented 1 year ago

it happened again in pulsar-metadata tests.

Current thread (0x00007f927808ed70):  JavaThread "C2 CompilerThread0" daemon [_thread_in_native, id=4012, stack(0x00007f8ffab00000,0x00007f8ffac00000)]

Current CompileTask:
C2: 365906 27888       4       org.apache.pulsar.metadata.bookkeeper.LongHierarchicalLedgerRangeIterator$InnerIterator::advance (207 bytes)

Stack: [0x00007f8ffab00000,0x00007f8ffac00000],  sp=0x00007f8ffabfac50,  free space=1003k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xad5113]  PhaseIdealLoop::build_loop_late_post_work(Node*, bool)+0xe3
V  [libjvm.so+0xad5bb0]  PhaseIdealLoop::build_loop_late(VectorSet&, Node_List&, Node_Stack&)+0x180
V  [libjvm.so+0xad652e]  PhaseIdealLoop::build_and_optimize()+0x7ae
V  [libjvm.so+0x5d6c8f]  PhaseIdealLoop::optimize(PhaseIterGVN&, LoopOptsMode)+0x16f
V  [libjvm.so+0x5d4b26]  Compile::Optimize()+0xde6
V  [libjvm.so+0x5d64e5]  Compile::Compile(ciEnv*, ciMethod*, int, bool, bool, bool, bool, bool, DirectiveSet*)+0xe65
V  [libjvm.so+0x5162c9]  C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0xe9
V  [libjvm.so+0x5df632]  CompileBroker::invoke_compiler_on_method(CompileTask*)+0xef2
V  [libjvm.so+0x5e0448]  CompileBroker::compiler_thread_loop()+0x678
V  [libjvm.so+0xe65584]  JavaThread::thread_main_inner()+0x184
V  [libjvm.so+0xe68d40]  Thread::call_run()+0xc0
V  [libjvm.so+0xc1e681]  thread_native_entry(Thread*)+0xe1

https://gist.github.com/lhotari/e191a541ca4558b624736b27620f48cd contains the full error report file.

hpvd commented 1 year ago

this issue still has label stale, see https://github.com/apache/pulsar/issues/21458

lhotari commented 1 year ago

More recent crashes with similar stack trace: https://gist.github.com/lhotari/53b72683ad4f339dfbcfd8b9b97062b9

lhotari commented 1 year ago

It looks like the issue will be fixed in 17.0.10 with https://bugs.openjdk.org/browse/JDK-8315861 which is the backport of https://bugs.openjdk.org/browse/JDK-8314024 . 17.0.10 is scheduled for Jan 16, 2024 based on https://wiki.openjdk.org/display/JDKUpdates/JDK+17u information.

lhotari commented 1 year ago

The fix https://bugs.openjdk.org/browse/JDK-8314024 is already included in 21.0.1 which has already been released. One possibility would be to run Pulsar on Java 21.0.1 to get the fix earlier. An ea build of 17.0.10 is another option. That should be available soon. The third option is to wait until 17.0.10 is released in the last half of January.

lhotari commented 1 year ago

There's another JVM crash where it happens with a different type of stack trace, in RocksDB:

Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [librocksdbjni10071231183631223410.so+0x2adff0]  rocksdb_iterator_helper(rocksdb::DB*, rocksdb::ReadOptions, rocksdb::ColumnFamilyHandle*)+0x60
j  org.rocksdb.RocksDB.iterator(JJ)J+0
j  org.rocksdb.RocksDB.newIterator(Lorg/rocksdb/ReadOptions;)Lorg/rocksdb/RocksIterator;+14

reported as #19250