eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.29k stars 723 forks source link

Coredump on libj9jit29.so when using async profiler with datadog module #15442

Open monwolf opened 2 years ago

monwolf commented 2 years ago

Java -version output

Output from java -version.

$ java -version
openjdk version "11.0.15" 2022-04-19
IBM Semeru Runtime Open Edition 11.0.15.0 (build 11.0.15+10)
Eclipse OpenJ9 VM 11.0.15.0 (build openj9-0.32.0, JRE 11 Linux amd64-64-Bit Compressed References 20220422_425 (JIT enabled, AOT enabled)
OpenJ9   - 9a84ec34e
OMR      - ab24b6666
JCL      - b7b5b42ea6 based on jdk-11.0.15+10)

Summary of problem

When enabling the async profiler with datadog module, we start seeing crashes in the service, https://github.com/DataDog/dd-trace-java/issues/3616, I don't know if related with datadog agent implementation or with the JIT.

Stderr:

Unhandled exception
Type=Segmentation error vmState=0x00000000
J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000080
Handler1=00007F7CB804EA20 Handler2=00007F7CB3D9C0E0 InaccessibleAddress=0000000000000000
RDI=00000000027B2910 RSI=0000000000000004 RAX=0080000000090300 RBX=00000000027B2910
RCX=0000000000400000 RDX=00007F7C9B72B130 R8=0000000000000000 R9=00000000027B9240
R10=00000000027B9208 R11=000000001C0C0100 R12=0000000000000000 R13=000000001C0C0100
R14=0000000000000000 R15=0000000000000005
RIP=00007F7CB29FD2E2 GS=0000 FS=0000 RSP=00007F7C9A53B120
EFlags=0000000000010202 CS=0033 RBP=00000000027B9278 ERR=0000000000000000
TRAPNO=000000000000000D OLDMASK=0000000000000000 CR2=0000000000000000
xmm0 0000003000000020 (f: 32.000000, d: 1.018558e-312)
xmm1 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm2 00007f7c9a53b5b0 (f: 2589177344.000000, d: 6.925473e-310)
xmm3 823f97e2cb4f6ac6 (f: 3410979584.000000, d: -7.548130e-298)
xmm4 38d93649e356e643 (f: 3814123008.000000, d: 7.586980e-35)
xmm5 3a000000c2000000 (f: 3254779904.000000, d: 2.524357e-29)
xmm6 269004cd5c0e6ac0 (f: 1544448768.000000, d: 6.058019e-123)
xmm7 00001969ed58d969 (f: 3982022912.000000, d: 1.380555e-310)
xmm8 0b0a090803020100 (f: 50462976.000000, d: 1.733947e-255)
xmm9 ffffffffffffffff (f: 4294967296.000000, d: -nan)
xmm10 7e0ebb6f592ae470 (f: 1495983232.000000, d: 1.607900e+299)
xmm11 09836ec563cd36ff (f: 1674393344.000000, d: 7.714130e-263)
xmm12 0000000000000001 (f: 1.000000, d: 4.940656e-324)
xmm13 08090a0b0c0d0e0f (f: 202182160.000000, d: 5.924543e-270)
xmm14 0000000000c167ee (f: 12675054.000000, d: 6.262309e-317)
xmm15 45c296f89cbe2ce8 (f: 2629709056.000000, d: 1.150649e+28)
Module=/opt/java/openjdk/lib/default/libj9jit29.so
Module_base_address=00007F7CB20AF000
Target=2_90_20220422_425 (Linux 3.10.0-1160.21.1.el7.x86_64)
CPU=amd64 (4 logical CPUs) (0x2e3aca000 RAM)
----------- Stack Backtrace -----------
jitWalkStackFrames+0x1482 (0x00007F7CB29FD2E2 [libj9jit29.so+0x94e2e2])
walkStackFrames+0xb3 (0x00007F7CB808E053 [libj9vm29.so+0x7f053])
_ZN32VM_BytecodeInterpreterCompressed3runEP10J9VMThread+0x5594 (0x00007F7CB80A62C4 [libj9vm29.so+0x972c4])
bytecodeLoopCompressed+0x95 (0x00007F7CB80A0D25 [libj9vm29.so+0x91d25])
 (0x00007F7CB814B942 [libj9vm29.so+0x13c942])
---------------------------------------
JVMDUMP039I Processing dump event "gpf", detail "" at 2022/06/28 07:10:00 - please wait.
JVMDUMP039I Processing dump event "abort", detail "" at 2022/06/28 07:10:00 - please wait.

[dd.trace 2022-06-28 08:39:13:431 +0000] [OkHttp http://10.145.0.84:8126/...] WARN com.datadog.profiling.uploader.ProfileUploader - Failed to upload profile, received empty reply from http://10.145.0.84:8126/profiling/v1/input after uploading profile (Will not log errors for 5 minutes)
[dd.trace 2022-06-28 08:40:09:155 +0000] [OkHttp http://10.145.0.84:8126/...] INFO com.datadog.profiling.uploader.ProfileUploader - Upload done

As we are running this on a volatile environment we haven't access to the coredumps and crash files.

0xdaryl commented 2 years ago

@mpirvu : could you have someone triage this problem please?

mpirvu commented 2 years ago

@gacholio does OpenJ9 support asynchronous sampling at the moment? I know you did some work in this area, but I don't know the outcome. Thanks

gacholio commented 2 years ago

Yes, this has been implemented. This crash is not in the async sampler which implies either it's not a problem with AsyncGetCallTrace at all, or the async call is corrupting the stack resulting in a later crash.

gacholio commented 2 years ago

Another consumer of ASGCT had to modify their signal handler to avoid a large on-stack buffer for the returned frames.

Some details may be found in #13838

mpirvu commented 2 years ago

@monwolf Could you please help us reproduce this issue? What would be the simplest setup that shows this bug? Personally, I have no experience with the Datadog profiler. Thanks

monwolf commented 2 years ago

I haven't had luck trying to reproduce with the petclinic springboot app. Tomorrow I'll try again.

yaakov-berkovitch commented 1 year ago

@monwolf any news regarding this issue ? We are facing same failure. We upgraded to Eclipse OpenJ9 VM 11.0.17.0 (build openj9-0.35.0, JRE 11 Linux amd64-64-Bit Compressed References 20221031_559 (JIT enabled, AOT enabled) but didn't help. I also confirmed that after disabling datadog agent things are running well.