async-profiler / async-profiler

Sampling CPU and HEAP profiler for Java featuring AsyncGetCallTrace + perf_events
https://github.com/async-profiler/async-profiler/releases
Apache License 2.0
7.58k stars 853 forks source link

Lookup::fillJavaMethodInfo may cause crash on JDK 8/11/17 #974

Open yanglong1010 opened 3 months ago

yanglong1010 commented 3 months ago

Describe the bug

Hi, Andrei

We recently encountered a crash. After analysis, I think this problem may occur on OpenJDK 8, 11, and 17. It will not occur on OpenJDK 18 and above because JDK-8268364 avoids this problem at the JDK level.

V [libjvm.so+0x766108] oopDesc::is_a(Klass) const+0x48 V [libjvm.so+0x7635ec] jvmti_GetClassSignature+0x15c C [libasyncProfiler.so+0x35389] Recording::finishChunk()+0x20c9 C [libasyncProfiler.so+0x3599c] FlightRecorder::stop()+0x9c C [libasyncProfiler.so+0x39559] Profiler::stop(bool)+0x159 C [libasyncProfiler.so+0x3b4af] Profiler::runInternal(Arguments&, Writer&) [clone .constprop.772]+0x32f C [libasyncProfiler.so+0x3b6d1] Profiler::run(Arguments&)+0x181 C [libasyncProfiler.so+0x3d019] Agent_OnAttach+0x209 V [libjvm.so+0x7c12d9] JvmtiExport::load_agent_library(AttachOperation, outputStream)+0x329 V [libjvm.so+0x3328df] attach_listener_thread_entry(JavaThread, Thread)+0x15f V [libjvm.so+0xb11d32] JavaThread::thread_main_inner()+0x212 V [libjvm.so+0x942a62] java_start(Thread)+0xf2 C [libpthread.so.0+0x7ea5] start_thread+0xc5


- JDK 11

#

A fatal error has been detected by the Java Runtime Environment:

#

SIGSEGV (0xb) at pc=0x00007f73c96aa658, pid=31632, tid=31752

#

JRE version: OpenJDK Runtime Environment (Red_Hat-11.0.23.0.9-2.el7_9) (11.0.23+9) (build 11.0.23+9-LTS)

Java VM: OpenJDK 64-Bit Server VM (Red_Hat-11.0.23.0.9-2.el7_9) (11.0.23+9-LTS, mixed mode, sharing, tiered, compressed oops, concurrent mark sweep gc, linux-amd64)

Problematic frame:

V [libjvm.so+0x9be658] oopDesc::is_a(Klass) const+0x48 V [libjvm.so+0xa04153] jvmti_GetClassSignature+0x173 C [libasyncProfiler2540885687583512417.so+0x72b6] _jvmtiEnv::GetClassSignature(_jclass, char, char)+0x38 C [libasyncProfiler2540885687583512417.so+0x131f3] Lookup::fillJavaMethodInfo(MethodInfo, _jmethodID, bool, int)+0x7b C [libasyncProfiler2540885687583512417.so+0x135cb] Lookup::resolveMethod(ASGCT_CallFrame&, int)+0x11b C [libasyncProfiler2540885687583512417.so+0x16afb] Recording::writeStackTraces(Buffer, Lookup)+0x147 C [libasyncProfiler2540885687583512417.so+0x16381] Recording::writeCpool(Buffer*)+0x19d C [libasyncProfiler2540885687583512417.so+0x141d3] Recording::finishChunk(bool)+0xdf


### Expected vs. actual behavior

- Expected: not crash
- Actual: It will definitely crash in certain scenarios as described bellow

### Reproduction Steps

I can't provide a simple reproducer, but I can describe the reproduction steps clearly (in my environment, this problem can be reproduced very easily)

- async-profiler/build/bin/asprof -e cpu -f profile.jfr -d 30
- Use a custom ClassLoader C and load a class A 
- Invoke the Method M of Class A
- async-profiler records CPU sample,Method M is on the stack, and the jmethodId of M is recorded.
- ClassLoader C and class A become dead objects
- GC occurs, and ClassLoader C is unloaded (including Class A). On OpenJDK 8, 11, and 17, `Method::clear_jmethod_ids` is not called synchronously when unloading, but will be called at the next `SafepointSynchronize::begin`, resulting in a delay.
- async-profiler begin to dump JFR, before next `SafepointSynchronize::begin`
- async-profiler calls `jvmti->GetMethodDeclaringClass(method, &method_class)`, which returns normally. This is because the `Method*` in jmethodId is not set to NULL, `Method*` and `InstanceKlass*` have not been purged, and the memory address is valid. `Method*` points to `InstanceKlass*`, but the java_mirror oop (that is, method_class) in `InstanceKlass` has been GCed.
- Async-profiler calls `jvmti->GetClassSignature(method_class, &class_name, NULL)`, which may cause a crash. This is because the `java_mirror` memory corresponding to `method_class` has been released by GC, and the `Klass Word` in the object header is invalid.

### Additional Information/Context

The info from core dump:

8 0x00007f654c046ee3 in Klass::is_subtype_of (this=0x6f56df778, k=0x100002000) at /root/openjdk/jdk8u-dev-fork/hotspot/src/share/vm/oops/klass.hpp:438

9 0x00007f654c04849d in oopDesc::is_a (this=0xaad81430, k=0x100002000) at /root/openjdk/jdk8u-dev-fork/hotspot/src/share/vm/oops/oop.inline.hpp:178

10 0x00007f654c4d4627 in jvmti_GetClassSignature (env=0x7f6484706758, klass=0x7f64545fb910, signature_ptr=0x7f64bc402e10, generic_ptr=0x0) at ../generated/jvmtifiles/jvmtiEnter.cpp:3242

11 0x00007f64bd6f2336 in _jvmtiEnv::GetClassSignature (this=0x7f6484706758, klass=0x7f64545fb910, signature_ptr=0x7f64bc402e10, generic_ptr=0x0) at /usr/lib/jvm/java-11-openjdk-11.0.23.0.9-2.el7_9.x86_64//include/jvmti.h:2242

12 0x00007f64bd6fe29c in Lookup::fillJavaMethodInfo (this=0x7f64bc402fa0, mi=0x7f64545fb648, method=0x7f646c003120, first_time=true, i=1) at src/flightRecorder.cpp:198

(gdb) x/xg 0x7f646c003120 [fillJavaMethodInfo(....method=0x7f646c003120)] 0x7f646c003120: 0x00007f64bdb54858 [*(Method**)jmethodId is not NULL]

(gdb) x/xg 0x7f64545fb910 (The klass returned from GetMethodDeclaringClass) 0x7f64545fb910: 0x00000000aad81430 (java_mirror oop)

(gdb) x/32xw 0xaad81430 (the memory of java_mirror oop is freed) 0xaad81430: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xaad81440: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xaad81450: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xaad81460: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xaad81470: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xaad81480: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xaad81490: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef

(gdb) p/x 0xdeadbeefl << 3 (from invalid Klass word, get invalid Klass*) $2 = 0x6f56df778

(gdb) x/xg 0x6f56df778 (Klass* is not accessible) 0x6f56df778: Cannot access memory at address 0x6f56df778


For comparison, let's look at the normal situation.

0 Klass::is_subtype_of (this=0x100002000, k=0x100002000) at /root/openjdk/jdk8u-dev-fork/hotspot/src/share/vm/oops/klass.hpp:437

1 0x00007ffff617049d in oopDesc::is_a (this=0xaae013e0, k=0x100002000) at /root/openjdk/jdk8u-dev-fork/hotspot/src/share/vm/oops/oop.inline.hpp:178

2 0x00007ffff65fc627 in jvmti_GetClassSignature (env=0x7fff382095c8, klass=0x7ffefc001100, signature_ptr=0x7fff1259de10, generic_ptr=0x0) at ../generated/jvmtifiles/jvmtiEnter.cpp:3242

3 0x00007fff1388d2b6 in _jvmtiEnv::GetClassSignature (this=0x7fff382095c8, klass=0x7ffefc001100, signature_ptr=0x7fff1259de10, generic_ptr=0x0) at /usr/lib/jvm/java-11-openjdk-11.0.23.0.9-2.el7_9.x86_64//include/jvmti.h:2242

4 0x00007fff138991f3 in Lookup::fillJavaMethodInfo (this=0x7fff1259dfa0, mi=0x7ffefc5bef28, method=0x7fff39378e58, first_time=true, i=0) at src/flightRecorder.cpp:198

(gdb) x/xg 0x7fff39378e58 [fillJavaMethodInfo(....method=0x7fff39378e58)] 0x7fff39378e58: 0x00007fffa5ec9670 [*(Method**)jmethodId is not NULL]

(gdb) x/xg 0x7ffefc001100 (The klass returned from GetMethodDeclaringClass) 0x7ffefc001100: 0x00000000aae013e0 (java_mirror oop)

(gdb) x/32xw 0xaae013e0 (valid java_mirror oop) 0xaae013e0: 0xceb1a511 0x0000006a 0x20000400 0x00000000 0xaae013f0: 0x00000000 0xab575050 0x00000000 0xab074f10 0xaae01400: 0xaabae230 0x00000000 0x00000000 0x00000000 0xaae01410: 0x00000000 0x00000000 0x00000000 0x00000000 0xaae01420: 0x00000000 0x00000000 0x0013dcb8 0x00000001 0xaae01430: 0x00000000 0x00000000 0x00000000 0x00000016 0xaae01440: 0x00000006 0x00000000 0xaaba5b18 0xaaad5508 0xaae01450: 0xaab56ca8 0xab075868 0xab04e728 0xaaad0ab8 (gdb) p/x 0x20000400l << 3 (valid Klass*,accessible) $1 = 0x100002000



There is a check for invalid jmethodId (NULL Method*) in async-profiler,
https://github.com/async-profiler/async-profiler/blob/b8025a24499a690ccd4f5ce909d348d82768bb35/src/flightRecorder.cpp#L175
But it is obviously impossible to avoid the problem described above, because the address in jmethodId is not NULL.

This seems to be a bad design in JDK, which causes a crash when calling JVMTI.  
I feel that I can do something in async-profiler workaround it.

- One way is to force a `Safepoint` to be triggered before dump, it needs to rely on `WB_ForceSafepoint`, but this is a tool for testing scenarios.
- Before the next `Safepoint`, `Method*`, `InstanceKlass*` and other more structs are still accessible, so Another way is to avoid using java_mirror and directly read the Class Name from `InstanceKlass` through `offset` (similar to the approach used by https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/3cede2b41bfc9b1bd272ced4cf528e49b635bcfb/interpreter/hotspot/instance.go#L270). 

I don't know if you have a better idea.  I would be grateful if you could help me fix this.

### Async-profiler version

latest

### Environment details

- Linux (any version)
- JDK 8/11/17
- All CPU architectures
- Its not related to container environment
apangin commented 3 months ago

Thank you for the detailed report. If your analysis is correct, this is certainly a JVM bug. We'll try to create a reproducer and if successful, submit a fix in OpenJDK.

yanglong1010 commented 3 months ago

Here are some of the hs_err files: OpenJDK 11 hs_err_pid31632.log OpenJDK 8u412 hs_err_pid12810.log OpenJDK8 slowdebug hs_err_pid15470.log

yanglong1010 commented 3 months ago

I can provide a reproducer to save your time. Wait for me for a moment.

yanglong1010 commented 3 months ago

I have successfully reproduced it on both OpenJDK 8 and 11 today (see the files below, they are also included in the attachment).

On different machines, you may need to adjust the profiling duration (the only parameter of reproduce-11.sh and reproduce-8.sh) so that the dump occurs after the Class is unloaded and before the next Safepoint.

You need to try more times because there are many variables involved, which is essentially a race condition.
If you see the "Done" and no crash occurred, then need to retry.
If a crash occurs, there will be clear logs on the screen, refer to the examples below.

issue-974-reproducer.tar.gz

-rw-r--r-- 1 root root 161181 Aug 22 11:16 hs_err_pid9934.log
-rw-r--r-- 1 root root 159660 Aug 22 11:42 hs_err_pid18519.log
-rw-r--r-- 1 root root 154176 Aug 22 11:51 hs_err_pid21948.log
-rw-r--r-- 1 root root 157954 Aug 22 11:59 hs_err_pid24809.log
-rw-r--r-- 1 root root  72309 Aug 22 16:48 hs_err_pid31831.log
-rw-r--r-- 1 root root 152441 Aug 22 16:56 hs_err_pid3089.log

OpenJDK 11

bash reproduce-11.sh 20

[2024-08-22T16:48:01.219+0800] GC(2) Concurrent Reset
[2024-08-22T16:48:01.242+0800] GC(2) Concurrent Reset 23.495ms
[2024-08-22T16:48:03.466+0800] GC(5) Concurrent Reset
[2024-08-22T16:48:03.494+0800] GC(5) Concurrent Reset 27.993ms
[2024-08-22T16:48:10.579+0800] unloading class com.example.demo.TempClass 0x00000008403e2040
[2024-08-22T16:48:10.592+0800] GC(15) Concurrent Reset
[2024-08-22T16:48:10.595+0800] GC(15) Concurrent Reset 2.295ms
[2024-08-22T16:48:17.621+0800] unloading class com.example.demo.TempClass 0x0000000840424840
[2024-08-22T16:48:17.621+0800] unloading class com.example.demo.TempClass 0x0000000840425040
[2024-08-22T16:48:17.621+0800] unloading class com.example.demo.TempClass 0x0000000840425840
[2024-08-22T16:48:17.621+0800] unloading class com.example.demo.TempClass 0x0000000840426040
[2024-08-22T16:48:17.621+0800] unloading class com.example.demo.TempClass 0x0000000840426840
[2024-08-22T16:48:17.621+0800] unloading class com.example.demo.TempClass 0x0000000840427040
[2024-08-22T16:48:17.634+0800] GC(16) Concurrent Reset
[2024-08-22T16:48:17.637+0800] GC(16) Concurrent Reset 2.193ms
Done
#  SIGSEGV (0xb) at pc=0x00007f55e89c0658, pid=31831, tid=31924
# V  [libjvm.so+0x9be658]  oopDesc::is_a(Klass*) const+0x48

OpenJDK 8

bash reproduce-8.sh 25
1.107: [CMS-concurrent-reset: 0.034/0.034 secs] [Times: user=0.11 sys=0.04, real=0.03 secs]
8.557: [CMS-concurrent-reset: 0.003/0.003 secs] [Times: user=0.03 sys=0.00, real=0.00 secs]
13.759: [GC (CMS Final Remark) [YG occupancy: 106902 K (111808 K)]13.759: [Rescan (parallel) , 0.0008202 secs]13.760: [weak refs processing, 0.0000369 secs]13.760: [class unloading[Unloading class com.example.demo.TempClass 0x0000000100466028]
13.777: [CMS-concurrent-reset: 0.002/0.002 secs] [Times: user=0.02 sys=0.00, real=0.00 secs]
20.901: [GC (CMS Final Remark) [YG occupancy: 108890 K (111808 K)]20.901: [Rescan (parallel) , 0.0008640 secs]20.902: [weak refs processing, 0.0000437 secs]20.902: [class unloading[Unloading class com.example.demo.TempClass 0x0000000100485028]
[Unloading class com.example.demo.TempClass 0x0000000100484828]
20.917: [CMS-concurrent-reset: 0.002/0.002 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
27.989: [GC (CMS Final Remark) [YG occupancy: 108890 K (111808 K)]27.989: [Rescan (parallel) , 0.0008756 secs]27.990: [weak refs processing, 0.0000449 secs]27.990: [class unloading[Unloading class com.example.demo.TempClass 0x0000000100484028]
[Unloading class com.example.demo.TempClass 0x0000000100483828]
[Unloading class com.example.demo.TempClass 0x0000000100483028]
[Unloading class com.example.demo.TempClass 0x0000000100482828]
28.005: [CMS-concurrent-reset: 0.002/0.002 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
Done
#  SIGSEGV (0xb) at pc=0x00007f6b96faa108, pid=24809, tid=0x00007f6b32cc5700
# V  [libjvm.so+0x766108]  oopDesc::is_a(Klass*) const+0x48
zdyj3170101136 commented 3 months ago

is this would happen with --cstack vm?

yanglong1010 commented 3 months ago

is this would happen with --cstack vm?

Yes. --cstack vm and AsyncGetCallTrace all get jmethodIds for the methods on the stack. Later on dump time, the jmethodIds are translated to Class and Method names, the crash occurs at this phase.

zdyj3170101136 commented 3 months ago

is this would happen with --cstack vm?

Yes. --cstack vm and AsyncGetCallTrace all get jmethodIds for the methods on the stack. Later on dump time, the jmethodIds are translated to Class and Method names, the crash occurs at this phase.

java have so many problem of profiling that I don't dare to open it in prod.

apangin commented 2 months ago

@krk managed to create a test that reliably reproduces the issue in <1 sec. The test only uses pure JVM TI, no AsyncGetCallTrace or other profiling code. The bad news is that the test crashes all JVM versions, including those having the patch for JDK-8268364. However, it still makes sense to backport the patch, since it helps to avoid crashes with a modified test that calls JNI DeleteLocalRef as soon as jclass is no longer needed.

So, a backport of JDK-8268364 together with DeleteLocalRef change in async-profiler will likely prevent crashes. In the meantime, I'm investigating other possible workarounds. Forcing a safepoint does not solve the issue, but VMStructs approach looks promising.

yanglong1010 commented 2 months ago

Is my analysis correct ? Why forcing a safepoint does not solve the issue ?

apangin commented 2 months ago

Why forcing a safepoint does not solve the issue ?

What prevents from class unloading after a safepoint?

yanglong1010 commented 2 months ago

You are right, other class unloading can happen after the manually triggered safepoint, and we can not prevent this.

yanglong1010 commented 2 months ago

The test only uses pure JVM TI, no AsyncGetCallTrace or other profiling code.

does the reproducer created by @krk follow my reproduction steps, or only used contrived jclass (with invalid pointer inside) to call jvmti GetClassSignature ?

If used my steps, I don't know why this issue happens on JDK 18 and above.