DataDog / dd-trace-java

Datadog APM client for Java
https://docs.datadoghq.com/tracing/languages/java
Apache License 2.0
583 stars 290 forks source link

Live-heap profiler (alpha) in dd-trace-java v1.33.0 crashes the JVM. (After updated from v1.32.0) #6948

Open stevenshim opened 6 months ago

stevenshim commented 6 months ago

Agent Version dd-java-agent:1.33.0

Describe what happened The JVM intermittently crashes and logs the following error.

Additional environment details Java VM: OpenJDK 64-Bit Server VM Corretto-17.0.6.10.1 AWS EKS 1.25

Log

2024-04-25T16:52:49+09:00 #
2024-04-25T16:52:49+09:00 # A fatal error has been detected by the Java Runtime Environment:
2024-04-25T16:52:49+09:00 #
2024-04-25T16:52:49+09:00 #  SIGSEGV (0xb) at pc=0x00007f9fba87a340, pid=1, tid=50
2024-04-25T16:52:49+09:00 #
2024-04-25T16:52:49+09:00 # JRE version: OpenJDK Runtime Environment Corretto-17.0.6.10.1 (17.0.6+10) (build 17.0.6+10-LTS)
2024-04-25T16:52:49+09:00 # Java VM: OpenJDK 64-Bit Server VM Corretto-17.0.6.10.1 (17.0.6+10-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
2024-04-25T16:52:49+09:00 # Problematic frame:
2024-04-25T16:52:49+09:00 # C  [libjavaProfiler8298632684405129428.so+0x40340]  Profiler::recordExternalSample(unsigned long long, int, _jvmtiFrameInfo*, int, bool, int, Event*)+0xb0
2024-04-25T16:52:49+09:00 #
2024-04-25T16:52:49+09:00 # Core dump will be written. Default location: /opt/goods-detail-api/core.1
2024-04-25T16:52:49+09:00 #
2024-04-25T16:52:49+09:00 # JFR recording file will be written. Location: /opt/goods-detail-api/hs_err_pid1.jfr
2024-04-25T16:52:49+09:00 #
2024-04-25T16:52:49+09:00 # An error report file with more information is saved as:
2024-04-25T16:52:49+09:00 # /opt/goods-detail-api/hs_err_pid1.log
2024-04-25T16:52:50+09:00 #
2024-04-25T16:52:50+09:00 # If you would like to submit a bug report, please visit:
2024-04-25T16:52:50+09:00 #   https://github.com/corretto/corretto-17/issues/
2024-04-25T16:52:50+09:00 # The crash happened outside the Java Virtual Machine in native code.
2024-04-25T16:52:50+09:00 # See problematic frame for where to report the bug.
2024-04-25T16:52:50+09:00 #
2024-04-25T16:52:50+09:00 
2024-04-25T16:52:50+09:00 [error occurred during error reporting (), id 0xb, SIGSEGV (0xb) at pc=0x00007fa02d41723b]
2024-04-25T16:52:50+09:00 

Hello, DataDog teams. I have been facing java process crashing after update dd-agent from 1.32.0 to 1.33.0.

I guess this prevention (https://github.com/DataDog/java-profiler/pull/84) might cause SIGSEGV. I already opened a issue on java-profiler project. (https://github.com/DataDog/java-profiler/issues/87)

I'm always grateful.

richardstartin commented 6 months ago

Hi @stevenshim - apologies for the inconvenience caused here and thanks for reporting the error. Could you give a little more information please?

  1. the stack trace from /opt/goods-detail-api/hs_err_pid1.log and the siginfo line just below it
  2. your Datadog profiling config flags
richardstartin commented 6 months ago

@stevenshim I suspect that you must be running the live heap profiler (DD_PROFILING_DDPROF_LIVEHEAP_ENABLED: true or -Ddd.profiling.ddprof.liveheap.enabled=true) - disabling this should be enough until we have a fix.

jbachorik commented 6 months ago

Hello @stevenshim - the issue was identified and a fix is in progress (https://github.com/DataDog/java-profiler/pull/88)

Thanks again for reporting this!

ghostdogpr commented 6 months ago

Is that issue happening only when Ddd.profiling.ddprof.liveheap.enabled=true? I've got 2 JVM crashes a few hours after upgrading to 1.33.0 that also say The crash happened outside the Java Virtual Machine in native code. (there's not much more detail though), but we didn't enable this flag.

ghostdogpr commented 6 months ago

I was able to replicate and save a crash dump. Maybe it's a different issue.

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f5d72c0a43b, pid=1, tid=165464
#
# JRE version: OpenJDK Runtime Environment Temurin-21.0.3+9 (21.0.3+9) (build 21.0.3+9-LTS)
# Java VM: OpenJDK 64-Bit Server VM Temurin-21.0.3+9 (21.0.3+9-LTS, mixed mode, sharing, tiered, compressed class ptrs, z gc, linux-amd64)
# Problematic frame:
# C  [ld-musl-x86_64.so.1+0x5943b]  strlen+0xd
#
# Core dump will be written. Default location: //core.1
#
# JFR recording file will be written. Location: //hs_err_pid1.jfr
#
# If you would like to submit a bug report, please visit:
#   https://github.com/adoptium/adoptium-support/issues
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

---------------  S U M M A R Y ------------

Command Line: -XX:MaxRAMPercentage=70.0 -XX:InitialRAMPercentage=70.0 -XX:-UseCompressedOops -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/coredump -XX:OnError=/app/scripts-deploy/jvm-crash-handler.sh -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.port=9900 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.rmi.port=9900 -Djava.rmi.server.hostname=172.23.81.106 -javaagent:/app/external/dd-java-agent.jar -Ddd.profiling.enabled=true -Ddd.integration.zio.experimental.enabled=true -XX:+PrintCodeCache -XX:ProfiledCodeHeapSize=512m -XX:NonProfiledCodeHeapSize=1024m -XX:NonNMethodCodeHeapSize=32m -XX:ReservedCodeCacheSize=1568m -XX:+UseZGC -XX:+ZGenerational -Duser.timezone=Asia/Seoul -Dio.netty.allocator.numHeapArenas=0 -Ddd.integration.throwables.enabled=false --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.sql/java.sql=ALL-UNNAMED com.devsisters.ck.CkMain

Host: Intel(R) Xeon(R) Platinum 8488C, 16 cores, 110G, Alpine Linux v3.19
Time: Thu May  9 13:38:28 2024 KST elapsed time: 76054.516656 seconds (0d 21h 7m 34s)

---------------  T H R E A D  ---------------

Current thread (0x00007f571bf42040):  JavaThread "SIGTERM handler" daemon [_thread_in_native, id=165464, stack(0x00007f56bd317000,0x00007f56bd417a78) (1026K)]

Stack: [0x00007f56bd317000,0x00007f56bd417a78],  sp=0x00007f56bd416748,  free space=1021k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [ld-musl-x86_64.so.1+0x5943b]  strlen+0xd
C  [libjavaProfiler11068430143537826157.so+0x29217]  Recording::writeSettings(Buffer*, Arguments&)+0xd7
C  [libjavaProfiler11068430143537826157.so+0x35a49]  Recording::switchChunk(int)+0xf9
C  [libjavaProfiler11068430143537826157.so+0x35b2f]  FlightRecorder::dump(char const*, int)+0x6f
C  [libjavaProfiler11068430143537826157.so+0x457d0]  Profiler::dump(char const*, int)+0x200
C  [libjavaProfiler11068430143537826157.so+0x245c7]  Java_com_datadoghq_profiler_JavaProfiler_dump0+0x47
J 323292  com.datadoghq.profiler.JavaProfiler.dump0(Ljava/lang/String;)V (0 bytes) @ 0x00007f5d31c718ff [0x00007f5d31c71820+0x00000000000000df]
J 323289 c1 com.datadog.profiling.ddprof.DatadogProfilerRecording.snapshot(Ljava/time/Instant;Ldatadog/trace/api/profiling/ProfilingSnapshot$Kind;)Ldatadog/trace/api/profiling/RecordingData; (44 bytes) @ 0x00007f5d0b178be4 [0x00007f5d0b1788a0+0x0000000000000344]
J 323287 c1 com.datadog.profiling.controller.ddprof.DatadogProfilerOngoingRecording.snapshot(Ljava/time/Instant;Ldatadog/trace/api/profiling/ProfilingSnapshot$Kind;)Ldatadog/trace/api/profiling/RecordingData; (16 bytes) @ 0x00007f5d0a72d8d4 [0x00007f5d0a72d6e0+0x00000000000001f4]
J 260563 c1 com.datadog.profiling.agent.CompositeController$CompositeOngoingRecording$$Lambda+0x00007f58c81d9338.apply(Ljava/lang/Object;)Ljava/lang/Object; (16 bytes) @ 0x00007f5d1469d84c [0x00007f5d1469d5e0+0x000000000000026c]
J 4274 c2 java.util.stream.ReferencePipeline$3$1.accept(Ljava/lang/Object;)V java.base@21.0.3 (23 bytes) @ 0x00007f5d28ec3d18 [0x00007f5d28ec3ca0+0x0000000000000078]
J 47514 c2 java.util.ArrayList$ArrayListSpliterator.forEachRemaining(Ljava/util/function/Consumer;)V java.base@21.0.3 (127 bytes) @ 0x00007f5d2b963e38 [0x00007f5d2b963d40+0x00000000000000f8]
J 32061 c2 java.util.stream.ReferencePipeline.collect(Ljava/util/stream/Collector;)Ljava/lang/Object; java.base@21.0.3 (124 bytes) @ 0x00007f5d2a9254ec [0x00007f5d2a925180+0x000000000000036c]
J 323261 c1 com.datadog.profiling.agent.CompositeController$CompositeOngoingRecording.snapshot(Ljava/time/Instant;Ldatadog/trace/api/profiling/ProfilingSnapshot$Kind;)Ldatadog/trace/api/profiling/RecordingData; (12 bytes) @ 0x00007f5d0a4c4754 [0x00007f5d0a4c4000+0x0000000000000754]
J 323260 c1 com.datadog.profiling.controller.ProfilingSystem$SnapshotRecording.snapshot(Z)V (144 bytes) @ 0x00007f5d07f3b784 [0x00007f5d07f3b540+0x0000000000000244]
j  com.datadog.profiling.controller.ProfilingSystem.shutdown(Z)V+29
j  com.datadog.profiling.agent.ProfilingAgent.shutdown(Lcom/datadog/profiling/controller/ProfilingSystem;Lcom/datadog/profiling/uploader/ProfileUploader;Z)V+17
j  com.datadog.profiling.agent.ProfilingAgent.shutdown(Z)V+7
j  java.lang.invoke.LambdaForm$DMH+0x00007f58cc09e800.invokeStatic(Ljava/lang/Object;I)V+10 java.base@21.0.3
j  java.lang.invoke.LambdaForm$MH+0x00007f58cc09f800.invoke(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+33 java.base@21.0.3
J 45127 c2 java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; java.base@21.0.3 (108 bytes) @ 0x00007f5d2b6454b8 [0x00007f5d2b644f80+0x0000000000000538]
j  datadog.trace.bootstrap.Agent.shutdownProfilingAgent(Z)V+64
j  datadog.trace.bootstrap.Agent.shutdown(Z)V+15
j  datadog.trace.bootstrap.instrumentation.shutdown.ShutdownHelper.shutdownAgent()V+11
j  java.lang.Shutdown.runHooks()V+0 java.base@21.0.3
j  java.lang.Shutdown.exit(I)V+12 java.base@21.0.3
j  java.lang.Terminator$1.handle(Ljdk/internal/misc/Signal;)V+8 java.base@21.0.3
j  jdk.internal.misc.Signal$1.run()V+8 java.base@21.0.3
j  java.lang.Thread.runWith(Ljava/lang/Object;Ljava/lang/Runnable;)V+5 java.base@21.0.3
j  java.lang.Thread.run()V+19 java.base@21.0.3
v  ~StubRoutines::call_stub 0x00007f5d26ccbcbf
V  [libjvm.so+0x912765]  JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*)+0x2e5
V  [libjvm.so+0x9140ab]  JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, JavaThread*)+0x1cb
V  [libjvm.so+0x9ee64a]  thread_entry(JavaThread*, JavaThread*)+0x8a
V  [libjvm.so+0x928ccf]  JavaThread::thread_main_inner() [clone .part.0]+0xaf
V  [libjvm.so+0xf82c98]  Thread::call_run()+0xa8
V  [libjvm.so+0xd0b23a]  thread_native_entry(Thread*)+0x12a
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
J 323292  com.datadoghq.profiler.JavaProfiler.dump0(Ljava/lang/String;)V (0 bytes) @ 0x00007f5d31c71887 [0x00007f5d31c71820+0x0000000000000067]
J 323289 c1 com.datadog.profiling.ddprof.DatadogProfilerRecording.snapshot(Ljava/time/Instant;Ldatadog/trace/api/profiling/ProfilingSnapshot$Kind;)Ldatadog/trace/api/profiling/RecordingData; (44 bytes) @ 0x00007f5d0b178be4 [0x00007f5d0b1788a0+0x0000000000000344]
J 323287 c1 com.datadog.profiling.controller.ddprof.DatadogProfilerOngoingRecording.snapshot(Ljava/time/Instant;Ldatadog/trace/api/profiling/ProfilingSnapshot$Kind;)Ldatadog/trace/api/profiling/RecordingData; (16 bytes) @ 0x00007f5d0a72d8d4 [0x00007f5d0a72d6e0+0x00000000000001f4]
J 260563 c1 com.datadog.profiling.agent.CompositeController$CompositeOngoingRecording$$Lambda+0x00007f58c81d9338.apply(Ljava/lang/Object;)Ljava/lang/Object; (16 bytes) @ 0x00007f5d1469d84c [0x00007f5d1469d5e0+0x000000000000026c]
J 4274 c2 java.util.stream.ReferencePipeline$3$1.accept(Ljava/lang/Object;)V java.base@21.0.3 (23 bytes) @ 0x00007f5d28ec3d18 [0x00007f5d28ec3ca0+0x0000000000000078]
J 47514 c2 java.util.ArrayList$ArrayListSpliterator.forEachRemaining(Ljava/util/function/Consumer;)V java.base@21.0.3 (127 bytes) @ 0x00007f5d2b963e38 [0x00007f5d2b963d40+0x00000000000000f8]
J 32061 c2 java.util.stream.ReferencePipeline.collect(Ljava/util/stream/Collector;)Ljava/lang/Object; java.base@21.0.3 (124 bytes) @ 0x00007f5d2a9254ec [0x00007f5d2a925180+0x000000000000036c]
J 323261 c1 com.datadog.profiling.agent.CompositeController$CompositeOngoingRecording.snapshot(Ljava/time/Instant;Ldatadog/trace/api/profiling/ProfilingSnapshot$Kind;)Ldatadog/trace/api/profiling/RecordingData; (12 bytes) @ 0x00007f5d0a4c4754 [0x00007f5d0a4c4000+0x0000000000000754]
J 323260 c1 com.datadog.profiling.controller.ProfilingSystem$SnapshotRecording.snapshot(Z)V (144 bytes) @ 0x00007f5d07f3b784 [0x00007f5d07f3b540+0x0000000000000244]
j  com.datadog.profiling.controller.ProfilingSystem.shutdown(Z)V+29
j  com.datadog.profiling.agent.ProfilingAgent.shutdown(Lcom/datadog/profiling/controller/ProfilingSystem;Lcom/datadog/profiling/uploader/ProfileUploader;Z)V+17
j  com.datadog.profiling.agent.ProfilingAgent.shutdown(Z)V+7
j  java.lang.invoke.LambdaForm$DMH+0x00007f58cc09e800.invokeStatic(Ljava/lang/Object;I)V+10 java.base@21.0.3
j  java.lang.invoke.LambdaForm$MH+0x00007f58cc09f800.invoke(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+33 java.base@21.0.3
J 45127 c2 java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; java.base@21.0.3 (108 bytes) @ 0x00007f5d2b6454b8 [0x00007f5d2b644f80+0x0000000000000538]
j  datadog.trace.bootstrap.Agent.shutdownProfilingAgent(Z)V+64
j  datadog.trace.bootstrap.Agent.shutdown(Z)V+15
j  datadog.trace.bootstrap.instrumentation.shutdown.ShutdownHelper.shutdownAgent()V+11
j  java.lang.Shutdown.runHooks()V+0 java.base@21.0.3
j  java.lang.Shutdown.exit(I)V+12 java.base@21.0.3
j  java.lang.Terminator$1.handle(Ljdk/internal/misc/Signal;)V+8 java.base@21.0.3
j  jdk.internal.misc.Signal$1.run()V+8 java.base@21.0.3
j  java.lang.Thread.runWith(Ljava/lang/Object;Ljava/lang/Runnable;)V+5 java.base@21.0.3
j  java.lang.Thread.run()V+19 java.base@21.0.3
v  ~StubRoutines::call_stub 0x00007f5d26ccbcbf

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x00007f58030185e4

Registers:
RAX=0x00007f58030185e4, RBX=0x000000000000218a, RCX=0x000000007265746c, RDX=0x0000000000000006
RSP=0x00007f56bd416748, RBP=0x00007f56bd416790, RSI=0x0000000000000006, RDI=0x00007f58030185e4
R8 =0x0000000000002182, R9 =0x0000000000002171, R10=0x0000000000000009, R11=0x0000000000000246
R12=0x00007f57fcd6a010, R13=0x00007f57fcd6a010, R14=0x0000000000002171, R15=0x00007f58030185e4
RIP=0x00007f5d72c0a43b, EFLAGS=0x0000000000010202, CSGSFS=0x002b000000000033, ERR=0x0000000000000004
  TRAPNO=0x000000000000000e
richardstartin commented 6 months ago

Hi @ghostdogpr that looks like another issue, apologies for the inconvenience and thanks for the backtrace. I recommend downgrading until we've figured out the cause (we'll update you).