Closed jbachorik closed 10 months ago
Have you taken steps to ensure the method iD stays valid (i.e. keeping a strong reference to the declaring class)? Method IDs are not roots themselves.
Yes, you are right, the JVMTI doc states for the jmethodID
that
However, if the class is unloaded, they become invalid and must not be used.
Coming from Hotspot where they went long lengths to allow usage of jmethodID
even for unloaded classes, I was taken by surprise.
Please note, that for the profiler usage it is really not possible to create strong references to classes - the stacktrace (which is comprised of the list of jmethodIds) is obtained in a signal handler and it is not possible to resolve all frames immediately and create references to classes. It wouldn't even be feasible doing this immediately when collecting the samples - the sampled thread is stopped and we want to resume it as soon as possible.
The only solution then would be listening on jvmtiEventCompiledMethodUnload
and maintaining a set of unloaded jmethodIds which will have to be skipped. This, in turn will result in much larger number of samples where we can not resolve the stack because of missing class/method information for some of the frames.
Actually, listening on CompileMethodUnload
event is not that helpful - even if I keep a set of methods unloaded till a particular moment there is no need to prevent a method getting unloaded in between doing check for jmethodID
being in the set of unloaded methods and using that jmethodID
for JVMTI calls.
Related thread in serviceability-dev mailing list: https://mail.openjdk.org/pipermail/serviceability-dev/2023-May/049150.html
CompiledMethodUnload
is not a reliable way to to anything - whether or not methods are JIT compiled is unpredictable, and clearly this will never be sent when running without the JIT.
There is no way to determine when classes are being unloaded (20 years ago the JVMTI event for this was removed).
I believe this problem is inherent to the async profiling API, and I can see no possible fix other than delaying the unload/invalidation of method IDs, which is neither practical nor reliable (wait long enough and you'll experience the problem).
Yes, CompiledMethodUnload
was a blunder. I verified that the root cause is the class unloading - once I disable it, everything is working fine. Concurrent class unloading just makes the problem more prominent because even our feeble attempt on validating a jmethodID is not thread safe and we can validate jmethodID before calling a JVMTI function but it does not guarantee it won't crash a few moments later.
I fully understand that having a stable jmethodID is quite difficult - however, not having it renders the AsyncGetCallTrace unusable because there is no way to convert the jmethodIDs identifying the stack frames in a safe way, without a great chance of crashing JVM. We can not call JVMTI functions from the signal handler and doing it any time later is risking the crash.
@tajila For comment.
If an agent attaches at startup and monitors every class load (JVMTI_EVENT_CLASS_LOAD), and asks every class for its jmethodIDs (GetClassMethods), the agent can then map from jmethodID -> class.
From there, it can use GetLoadedClasses
to find all the classes still loaded and update the mapping to remove classes that aren't loaded anymore. Because GetLoadedClasses
returns a jclass
for each class, anything still findable in the map after processing will be valid until the jclasses are released.
I think that would work but would need to prototype it to confirm.
I don't think a class unload event would be sufficient as with concurrent class unloaded, classes could be unloaded while stil processing the jmethodIDs. Holding onto the jclass
from GetLoadedClasses
is required.
@DanHeidinga The method IDs are coming from AsyncGetCallTrace
, so there's no way to predict what classes are involved.
Holding a strong reference to every loaded class would just be a complicated way of disabling class unloading.
Coming from Hotspot where they went long lengths to allow usage of jmethodID even for unloaded classes, I was taken by surprise.
@jbachorik What is the expected behaviour when using a jmethodID of an unloaded class?
@DanHeidinga The method IDs are coming from
AsyncGetCallTrace
, so there's no way to predict what classes are involved.
Right. This would be a separate of bi-directional mapping from jmethodID <-> class
so when ASGCT provides a set of jmethodID
s, before processing them, the agent would call GetLoadedClasses
and use that to trim the mapping.
While holding the jclass array, it can then validate that the jmethodID
s are still in the mapping array. Anything not in the mapping array is invalid and must be ignored. Once the processing is done, the jclass array is released and class unloading can occur again.
It's basically building up a cache of all valid jmethodIDs as classes are loaded, and then using the GetLoadedClasses
as a temporary way to block class unloading while processing after trimming the set of valid methods.
Holding a strong reference to every loaded class would just be a complicated way of disabling class unloading.
Agreed. But it's a temporary way to disable it that fits within the JVMTI spec
Could we just tag classloaders that have handed out JNIIDs via AGCST. Then when free them (freeClassLoader()
) invalidate the JNIMethodIDs with UnknownClass.unknownMethod (class UnknownClass { void unknownMethod(){}}
) ?
We would need to keep track of the ID pools on javaVM (or something like that) once the classloaders are gone. This would mean that we are leaking JNIIDs to some extent. I don't have a sense for how much this would cost.
@DanHeidinga Remember that ASGCT is called from a signal handler, so no real calls can be made. Once the IDs are shuffled to another thread, they are immediately invalidateable (word?).
Could we just tag classloaders that have handed out JNIIDs via AGCST.
No, any mutation from the signal handler is inherently invalid.
The ASGCT API is inherently unreliable - we should just remove it. There are many sensible ways of instrumenting java code that don't rely on random timing. There's a reason this has never made it into an officlal API.
Maybe we should have a meeting about this. I can think of some impractical (and possibly unimplementable) solutions, but they will never be reliable (by which I mean 100% safe from crash or invalid data). It seems to me that the async profiler API is broken, and only works by herculean efforts on HotSpot's part that probably adversely affect real world performance).
Yes, that's also my impression that with the concurrent class loading the ASGCT usage is very hit'n'miss and it is very likely to fail intermittently. Hotspot (and Zing by extension) does something under the hood to make this much less likely but there is still a slim chance of code checking and using jmethodID racing vs. the class unloading and leading to JVM crash.
We (mostly Johannes) are trying to come up with a better solution in https://bugs.openjdk.org/browse/JDK-8284289 - but the 'old' ASGCT seems to be doomed, considering this.
Tobi's away for the next week - we will have a discussion about this when he returns.
Let me reopen this discussion.
For the authors of JVMTI profilers it would be enough if it was possible to reliably detect jmethodID
value pointing to unloaded/deallocated structures. I did prototype with custom jmethodID->method FQN map in a JVMTI agent but the cost is terrible, both in terms of extra memory and the startup time overhead. Resolving a jmethodID
into the FQN method name is rather heavy.
I tried groking the OpenJ9 sources for places where method structures get deallocated to see if something similar to https://bugs.openjdk.org/browse/JDK-8313816 is happening, but, apparently, I am not familiar enough with the codebase to arrive at anything meaningful. I wonder if anyone can direct me to places where the jmethodID
maintenance is performed (if at all) - eg. when a method is redefined, the jmethodID
should be redirected to the new version or when a method is deallocated, either because it's holder class is getting unloaded or the previous class version (before redefinition) is being purged - I would be grateful.
I think that even though JVMTI spec is pretty strict about usage of jmethodID
values, the implementations are quite benevolent in the sense that if you want to use an invalid jmethodID
to eg. resolve the defining class you should get NULL
instead of crashing the JVM.
Otherwise any stack-sampling JVMTI profiler would be basically a russian-roullette - it is not feasible to walk the stacktrace obtained by GetStackTrace
or GetAllStackTraces
JVMTI calls to create strong references for all classes referenced from all stack frames and therefore, when the time comes to actually write out the collected stacktraces in a human readable format there is a non-negligible chance that some of the required classes are already unloaded.
And since the JVMTI stack sampling profilers are not crashing all the time for J9, I am assuming that there is some kind of protection there already and what we are really hitting is just some corner case, similar to what is happening in the Hotspot.
@jbachorik We are working on a prototype that will keep jniIDs stable at the cost of leaking some memory when class unloading occurs. We want to understand if this approach is feasible by seeing how it works in real world use cases. Could you try our prototype and measure the footprint? Which platforms do you run on?
Long term, I think AGCTv2 will be the solution, however, it may be a while before it is released. In the meantime I dont see any solution that doesn't involve some kind of trade-off.
Note that the the referenced PR is not a prototype of anything - it does nothing to address invalid method IDs.
It's current purpose is to measure how bad the memory leak will be if we go ahead with this solution.
If we move forward with this, I don't think we'll return NULL from the JVMTI queries - we'll consider the IDs to be invalid and return the appropriate error code.
@gacholio , error code is fine. As long as JVM does not hard-crash.
As for how bad the memory leak is - I would assume it would be approximately as bad as in hotspot where also the jmethodID wrapper instances are leaking. I expect the leak to be ~8 bytes per method on a 64bit system (the jmethodID
type is a pointer to a pointer, right?).
It might, perhaps, be possible to employ some reference counting mechanism where eg. it would be upon the API user to 'touch' all jmethodID values in the returned stacktrace and then 'release' them once the stacktrace is no longer needed. TBH, I am not sure if it is worth the hassle, though.
With the current head stream code, you should be able to add -XX:+KeepJNIIDs
and get the desired behaviour.
Note that the size of an ID is 4 pointer slots (only 2 are used by method IDs, but they are allocated from a generic pool which also hosts field IDs).
Each pool structure is several allocations, the largest usually starting at 4K. The pool is per-ClassLoader, not per-Class.
@jbachorik Please verify that a current VM fixes this issue. I'll add the optimization from https://github.com/eclipse-openj9/openj9/pull/18684#issuecomment-1879233182 once we have confirmation.
@gacholio I can confirm that with #18684 the jmethodID usage by our profiler is stable. Thanks for fixing the problem!
Thanks for the verification - I'll proceed with the optimization shortly.
When using
jvmtiGetMethodDeclaringClass
to resolve a previously storedjmethodID
value and obtain the information about the declaring class the JVM will crash reliably.This is happening for Semeru 17.0.7, 11.0.19 but not 8.0.372.
The relevant part of the backtrace looks like
I am attaching the report files. report.zip