eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.28k stars 722 forks source link

Crash in 'jvmtiGetMethodDeclaringClass' #17466

Closed jbachorik closed 10 months ago

jbachorik commented 1 year ago

When using jvmtiGetMethodDeclaringClass to resolve a previously stored jmethodID value and obtain the information about the declaring class the JVM will crash reliably.

This is happening for Semeru 17.0.7, 11.0.19 but not 8.0.372.

The relevant part of the backtrace looks like

4XENATIVESTACK                (0x00007F8DD16B0520 [libc.so.6+0x42520])
4XENATIVESTACK               jvmtiGetMethodDeclaringClass+0xa2 (0x00007F8DD05ED622 [libj9jvmti29.so+0x24622])
4XENATIVESTACK               _ZN6Lookup18fillJavaMethodInfoEP10MethodInfoP10_jmethodIDb+0xaf (0x00007F8D788F58EF [libjavaProfiler8844014182548782683.so+0x238ef])
4XENATIVESTACK               _ZN9Recording10writeCpoolEP6Buffer.constprop.0+0xafa (0x00007F8D788F077A [libjavaProfiler8844014182548782683.so+0x1e77a])
4XENATIVESTACK               _ZN14FlightRecorder4dumpEPKci+0x18a (0x00007F8D788F318A [libjavaProfiler8844014182548782683.so+0x2118a])
4XENATIVESTACK               _ZN8Profiler4dumpEPKci+0x119 (0x00007F8D7890B0D9 [libjavaProfiler8844014182548782683.so+0x390d9])

I am attaching the report files. report.zip

gacholio commented 1 year ago

Have you taken steps to ensure the method iD stays valid (i.e. keeping a strong reference to the declaring class)? Method IDs are not roots themselves.

jbachorik commented 1 year ago

Yes, you are right, the JVMTI doc states for the jmethodID that

However, if the class is unloaded, they become invalid and must not be used.

Coming from Hotspot where they went long lengths to allow usage of jmethodID even for unloaded classes, I was taken by surprise.

Please note, that for the profiler usage it is really not possible to create strong references to classes - the stacktrace (which is comprised of the list of jmethodIds) is obtained in a signal handler and it is not possible to resolve all frames immediately and create references to classes. It wouldn't even be feasible doing this immediately when collecting the samples - the sampled thread is stopped and we want to resume it as soon as possible.

The only solution then would be listening on jvmtiEventCompiledMethodUnload and maintaining a set of unloaded jmethodIds which will have to be skipped. This, in turn will result in much larger number of samples where we can not resolve the stack because of missing class/method information for some of the frames.

jbachorik commented 1 year ago

Actually, listening on CompileMethodUnload event is not that helpful - even if I keep a set of methods unloaded till a particular moment there is no need to prevent a method getting unloaded in between doing check for jmethodID being in the set of unloaded methods and using that jmethodID for JVMTI calls.

jbachorik commented 1 year ago

Related thread in serviceability-dev mailing list: https://mail.openjdk.org/pipermail/serviceability-dev/2023-May/049150.html

gacholio commented 1 year ago

CompiledMethodUnload is not a reliable way to to anything - whether or not methods are JIT compiled is unpredictable, and clearly this will never be sent when running without the JIT.

There is no way to determine when classes are being unloaded (20 years ago the JVMTI event for this was removed).

I believe this problem is inherent to the async profiling API, and I can see no possible fix other than delaying the unload/invalidation of method IDs, which is neither practical nor reliable (wait long enough and you'll experience the problem).

jbachorik commented 1 year ago

Yes, CompiledMethodUnload was a blunder. I verified that the root cause is the class unloading - once I disable it, everything is working fine. Concurrent class unloading just makes the problem more prominent because even our feeble attempt on validating a jmethodID is not thread safe and we can validate jmethodID before calling a JVMTI function but it does not guarantee it won't crash a few moments later.

I fully understand that having a stable jmethodID is quite difficult - however, not having it renders the AsyncGetCallTrace unusable because there is no way to convert the jmethodIDs identifying the stack frames in a safe way, without a great chance of crashing JVM. We can not call JVMTI functions from the signal handler and doing it any time later is risking the crash.

gacholio commented 1 year ago

@tajila For comment.

DanHeidinga commented 1 year ago

If an agent attaches at startup and monitors every class load (JVMTI_EVENT_CLASS_LOAD), and asks every class for its jmethodIDs (GetClassMethods), the agent can then map from jmethodID -> class.

From there, it can use GetLoadedClasses to find all the classes still loaded and update the mapping to remove classes that aren't loaded anymore. Because GetLoadedClasses returns a jclass for each class, anything still findable in the map after processing will be valid until the jclasses are released.

I think that would work but would need to prototype it to confirm.

I don't think a class unload event would be sufficient as with concurrent class unloaded, classes could be unloaded while stil processing the jmethodIDs. Holding onto the jclass from GetLoadedClasses is required.

gacholio commented 1 year ago

@DanHeidinga The method IDs are coming from AsyncGetCallTrace, so there's no way to predict what classes are involved.

Holding a strong reference to every loaded class would just be a complicated way of disabling class unloading.

tajila commented 1 year ago

Coming from Hotspot where they went long lengths to allow usage of jmethodID even for unloaded classes, I was taken by surprise.

@jbachorik What is the expected behaviour when using a jmethodID of an unloaded class?

DanHeidinga commented 1 year ago

@DanHeidinga The method IDs are coming from AsyncGetCallTrace, so there's no way to predict what classes are involved.

Right. This would be a separate of bi-directional mapping from jmethodID <-> class so when ASGCT provides a set of jmethodIDs, before processing them, the agent would call GetLoadedClasses and use that to trim the mapping.

While holding the jclass array, it can then validate that the jmethodIDs are still in the mapping array. Anything not in the mapping array is invalid and must be ignored. Once the processing is done, the jclass array is released and class unloading can occur again.

It's basically building up a cache of all valid jmethodIDs as classes are loaded, and then using the GetLoadedClasses as a temporary way to block class unloading while processing after trimming the set of valid methods.

Holding a strong reference to every loaded class would just be a complicated way of disabling class unloading.

Agreed. But it's a temporary way to disable it that fits within the JVMTI spec

tajila commented 1 year ago

Could we just tag classloaders that have handed out JNIIDs via AGCST. Then when free them (freeClassLoader()) invalidate the JNIMethodIDs with UnknownClass.unknownMethod (class UnknownClass { void unknownMethod(){}}) ?

tajila commented 1 year ago

We would need to keep track of the ID pools on javaVM (or something like that) once the classloaders are gone. This would mean that we are leaking JNIIDs to some extent. I don't have a sense for how much this would cost.

gacholio commented 1 year ago

@DanHeidinga Remember that ASGCT is called from a signal handler, so no real calls can be made. Once the IDs are shuffled to another thread, they are immediately invalidateable (word?).

Could we just tag classloaders that have handed out JNIIDs via AGCST.

No, any mutation from the signal handler is inherently invalid.

The ASGCT API is inherently unreliable - we should just remove it. There are many sensible ways of instrumenting java code that don't rely on random timing. There's a reason this has never made it into an officlal API.

gacholio commented 1 year ago

Maybe we should have a meeting about this. I can think of some impractical (and possibly unimplementable) solutions, but they will never be reliable (by which I mean 100% safe from crash or invalid data). It seems to me that the async profiler API is broken, and only works by herculean efforts on HotSpot's part that probably adversely affect real world performance).

jbachorik commented 1 year ago

Yes, that's also my impression that with the concurrent class loading the ASGCT usage is very hit'n'miss and it is very likely to fail intermittently. Hotspot (and Zing by extension) does something under the hood to make this much less likely but there is still a slim chance of code checking and using jmethodID racing vs. the class unloading and leading to JVM crash.

We (mostly Johannes) are trying to come up with a better solution in https://bugs.openjdk.org/browse/JDK-8284289 - but the 'old' ASGCT seems to be doomed, considering this.

gacholio commented 1 year ago

Tobi's away for the next week - we will have a discussion about this when he returns.

jbachorik commented 11 months ago

Let me reopen this discussion.

For the authors of JVMTI profilers it would be enough if it was possible to reliably detect jmethodID value pointing to unloaded/deallocated structures. I did prototype with custom jmethodID->method FQN map in a JVMTI agent but the cost is terrible, both in terms of extra memory and the startup time overhead. Resolving a jmethodID into the FQN method name is rather heavy.

I tried groking the OpenJ9 sources for places where method structures get deallocated to see if something similar to https://bugs.openjdk.org/browse/JDK-8313816 is happening, but, apparently, I am not familiar enough with the codebase to arrive at anything meaningful. I wonder if anyone can direct me to places where the jmethodID maintenance is performed (if at all) - eg. when a method is redefined, the jmethodID should be redirected to the new version or when a method is deallocated, either because it's holder class is getting unloaded or the previous class version (before redefinition) is being purged - I would be grateful.

I think that even though JVMTI spec is pretty strict about usage of jmethodID values, the implementations are quite benevolent in the sense that if you want to use an invalid jmethodID to eg. resolve the defining class you should get NULL instead of crashing the JVM. Otherwise any stack-sampling JVMTI profiler would be basically a russian-roullette - it is not feasible to walk the stacktrace obtained by GetStackTrace or GetAllStackTraces JVMTI calls to create strong references for all classes referenced from all stack frames and therefore, when the time comes to actually write out the collected stacktraces in a human readable format there is a non-negligible chance that some of the required classes are already unloaded. And since the JVMTI stack sampling profilers are not crashing all the time for J9, I am assuming that there is some kind of protection there already and what we are really hitting is just some corner case, similar to what is happening in the Hotspot.

tajila commented 11 months ago

@jbachorik We are working on a prototype that will keep jniIDs stable at the cost of leaking some memory when class unloading occurs. We want to understand if this approach is feasible by seeing how it works in real world use cases. Could you try our prototype and measure the footprint? Which platforms do you run on?

Long term, I think AGCTv2 will be the solution, however, it may be a while before it is released. In the meantime I dont see any solution that doesn't involve some kind of trade-off.

gacholio commented 11 months ago

Note that the the referenced PR is not a prototype of anything - it does nothing to address invalid method IDs.

It's current purpose is to measure how bad the memory leak will be if we go ahead with this solution.

gacholio commented 11 months ago

If we move forward with this, I don't think we'll return NULL from the JVMTI queries - we'll consider the IDs to be invalid and return the appropriate error code.

jbachorik commented 11 months ago

@gacholio , error code is fine. As long as JVM does not hard-crash.

As for how bad the memory leak is - I would assume it would be approximately as bad as in hotspot where also the jmethodID wrapper instances are leaking. I expect the leak to be ~8 bytes per method on a 64bit system (the jmethodID type is a pointer to a pointer, right?). It might, perhaps, be possible to employ some reference counting mechanism where eg. it would be upon the API user to 'touch' all jmethodID values in the returned stacktrace and then 'release' them once the stacktrace is no longer needed. TBH, I am not sure if it is worth the hassle, though.

gacholio commented 11 months ago

With the current head stream code, you should be able to add -XX:+KeepJNIIDs and get the desired behaviour.

gacholio commented 11 months ago

Note that the size of an ID is 4 pointer slots (only 2 are used by method IDs, but they are allocated from a generic pool which also hosts field IDs).

Each pool structure is several allocations, the largest usually starting at 4K. The pool is per-ClassLoader, not per-Class.

gacholio commented 10 months ago

@jbachorik Please verify that a current VM fixes this issue. I'll add the optimization from https://github.com/eclipse-openj9/openj9/pull/18684#issuecomment-1879233182 once we have confirmation.

jbachorik commented 9 months ago

@gacholio I can confirm that with #18684 the jmethodID usage by our profiler is stable. Thanks for fixing the problem!

gacholio commented 9 months ago

Thanks for the verification - I'll proceed with the optimization shortly.