Open dmitry-ten opened 3 years ago
I have opened #12966 that addresses the bug but I'm not sure if it's the best approach to it.
A different thing that could be done is to make sure that reclaimAssumptions
is always passed metaData
, when it exists, instead of passing NULL when handling failures.
@mpirvu fyi.
@dsouzai, @fjeremic since you worked on RAT in the past, maybe you would have better suggestions?
This means that metadata->runtimeAssumptionList is not guaranteed to be NULL after the entire assumption list is reclaimed. This leads to crashes with JITServer, although the issue is not JITServer specific.
Question, what is looking at metadata->runtimeAssumptionList
after we reclaim assumptions? Grepping the code, most places will either pass in the metadata or will set metadata->runtimeAssumptionList
to NULL right after. The only place that doesn't is TR::CompilationInfoPerThreadBase::processException
, which only happens when we fail the compile, in which case the metadata is invalid anyway.
When we reclaim assumptions from processException
, we might be not holding the VM access, so if a GC cycle happens after the assumptions are reclaimed but before VM access is reacquired by the compilation thread, we can end up in a situation when metadata hasn't been invalidated yet and the assumption list is invalid.
Although I'm not 100% certain if that's what's causing crashes on JITServer client. Will try to collect more conclusive data on this. I know that the order in which we create metadata and add it to the classloader list of metadatas is slightly different for remote compilations so that could be affecting why we don't see these crashes without JITServer.
When we reclaim assumptions from processException, we might be not holding the VM access, so if a GC cycle happens after the assumptions are reclaimed but before VM access is reacquired by the compilation thread, we can end up in a situation when metadata hasn't been invalidated yet and the assumption list is invalid.
If a GC cycle occurs, in jitReleaseCodeCollectMetaData
we're only going to reclaim a successfully compiled body, so I'm not sure how it would get its hand on the metadata. Granted it doesn't look like we free the metadata from the AVL trees on compile failure so that's a memory leak.
I figured out when this scenario occurs.
In this code,
https://github.com/eclipse-openj9/openj9/blob/0385f07370413904765c93c6e93fa5c3f9be4006/runtime/compiler/control/JITClientCompilationThread.cpp#L3399-L3406
If a trampoline cannot be reserved on the client, compilation will fail, and the failure will be handled by processException
which will reclaim assumptions. However, at this point in the compilation, we already did relocations and added method's metadata to the tree which is cleaned up by GC. So, when the next GC cycle occurs, it will try to reclaim assumptions from metadata because runtimeAssumptionList
pointer wasn't set to NULL.
This explains why this error only happens on JITServer, since for non-JITServer resolved trampolines are reserved during compilation.
It's another question why trampoline reservation fails in the first place, but that's something to be addressed separately.
So, when the next GC cycle occurs, it will try to reclaim assumptions from metadata because runtimeAssumptionList pointer wasn't set to NULL.
Do we add the body to the faint cache block list when we fail the compile in JITServer? I guess my concern comes from the fact that in createMethodMetaData
we call artifactManager->insertArtifact
to add the metadata into the AVL trees. If a compile fails after this point, I guess I don't see that node getting removed unless we reclaim the code in HookHelpers. So there might be a problem there.
However, we're only going to reclaim bodies if we've added them to the faint cache block list, but that seems to only happen in jitReleaseCodeStackWalk
, which is a runtime thing. In the issue you're running into we haven't executed the body yet. In other words, we only reclaim a body when we decide to recompile it; if a compile fails, we're supposed to just release the reserved code cache. So I'm still not sure how the code that runs in a GC cycle could get its hand on the metadata of this (currently being compiled) body; even if the node is in the AVL tree, the infra won't remove it from the AVL tree until we process the faint cache blocks.
I could be missing some other code path though.
https://github.com/eclipse-openj9/openj9/blob/0a9c6ead004198c0bbf28d0fbc418d822624a45f/runtime/compiler/runtime/ClassUnloadAssumption.cpp#L516 This method is passed a double pointer to the sentinel runtime assumption location and a pointer to the metadata. When the method detects that all assumptions in a list were marked for detach, it will reset the sentinel pointer address to NULL. https://github.com/eclipse-openj9/openj9/blob/0a9c6ead004198c0bbf28d0fbc418d822624a45f/runtime/compiler/runtime/ClassUnloadAssumption.cpp#L559-L564 However, in many cases, the address of the sentinel pointer is the address inside compilation object, not the address inside the metadata. https://github.com/eclipse-openj9/openj9/blob/0a9c6ead004198c0bbf28d0fbc418d822624a45f/runtime/compiler/control/CompilationThread.cpp#L10743-L10748 This means that
metadata->runtimeAssumptionList
is not guaranteed to be NULL after the entire assumption list is reclaimed. This leads to crashes with JITServer, although the issue is not JITServer specific.Also note that the same piece of code has a memory leak, but I will address that in a separate issue.