RAT reclamation not resetting assumption list pointer inside metadata

eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.

Other

3.28k stars 720 forks source link

RAT reclamation not resetting assumption list pointer inside metadata #12967

Open dmitry-ten opened 3 years ago

dmitry-ten commented 3 years ago

https://github.com/eclipse-openj9/openj9/blob/0a9c6ead004198c0bbf28d0fbc418d822624a45f/runtime/compiler/runtime/ClassUnloadAssumption.cpp#L516 This method is passed a double pointer to the sentinel runtime assumption location and a pointer to the metadata. When the method detects that all assumptions in a list were marked for detach, it will reset the sentinel pointer address to NULL. https://github.com/eclipse-openj9/openj9/blob/0a9c6ead004198c0bbf28d0fbc418d822624a45f/runtime/compiler/runtime/ClassUnloadAssumption.cpp#L559-L564 However, in many cases, the address of the sentinel pointer is the address inside compilation object, not the address inside the metadata. https://github.com/eclipse-openj9/openj9/blob/0a9c6ead004198c0bbf28d0fbc418d822624a45f/runtime/compiler/control/CompilationThread.cpp#L10743-L10748 This means that metadata->runtimeAssumptionList is not guaranteed to be NULL after the entire assumption list is reclaimed. This leads to crashes with JITServer, although the issue is not JITServer specific.

Also note that the same piece of code has a memory leak, but I will address that in a separate issue.

dmitry-ten commented 3 years ago

I have opened #12966 that addresses the bug but I'm not sure if it's the best approach to it. A different thing that could be done is to make sure that reclaimAssumptions is always passed metaData, when it exists, instead of passing NULL when handling failures. @mpirvu fyi. @dsouzai, @fjeremic since you worked on RAT in the past, maybe you would have better suggestions?

dsouzai commented 3 years ago

This means that metadata->runtimeAssumptionList is not guaranteed to be NULL after the entire assumption list is reclaimed. This leads to crashes with JITServer, although the issue is not JITServer specific.

Question, what is looking at metadata->runtimeAssumptionList after we reclaim assumptions? Grepping the code, most places will either pass in the metadata or will set metadata->runtimeAssumptionList to NULL right after. The only place that doesn't is TR::CompilationInfoPerThreadBase::processException, which only happens when we fail the compile, in which case the metadata is invalid anyway.

dmitry-ten commented 3 years ago

When we reclaim assumptions from processException, we might be not holding the VM access, so if a GC cycle happens after the assumptions are reclaimed but before VM access is reacquired by the compilation thread, we can end up in a situation when metadata hasn't been invalidated yet and the assumption list is invalid. Although I'm not 100% certain if that's what's causing crashes on JITServer client. Will try to collect more conclusive data on this. I know that the order in which we create metadata and add it to the classloader list of metadatas is slightly different for remote compilations so that could be affecting why we don't see these crashes without JITServer.

dsouzai commented 3 years ago

When we reclaim assumptions from processException, we might be not holding the VM access, so if a GC cycle happens after the assumptions are reclaimed but before VM access is reacquired by the compilation thread, we can end up in a situation when metadata hasn't been invalidated yet and the assumption list is invalid.

If a GC cycle occurs, in jitReleaseCodeCollectMetaData we're only going to reclaim a successfully compiled body, so I'm not sure how it would get its hand on the metadata. Granted it doesn't look like we free the metadata from the AVL trees on compile failure so that's a memory leak.

dmitry-ten commented 3 years ago

I figured out when this scenario occurs. In this code, https://github.com/eclipse-openj9/openj9/blob/0385f07370413904765c93c6e93fa5c3f9be4006/runtime/compiler/control/JITClientCompilationThread.cpp#L3399-L3406 If a trampoline cannot be reserved on the client, compilation will fail, and the failure will be handled by processException which will reclaim assumptions. However, at this point in the compilation, we already did relocations and added method's metadata to the tree which is cleaned up by GC. So, when the next GC cycle occurs, it will try to reclaim assumptions from metadata because runtimeAssumptionList pointer wasn't set to NULL. This explains why this error only happens on JITServer, since for non-JITServer resolved trampolines are reserved during compilation.

It's another question why trampoline reservation fails in the first place, but that's something to be addressed separately.

dsouzai commented 3 years ago

So, when the next GC cycle occurs, it will try to reclaim assumptions from metadata because runtimeAssumptionList pointer wasn't set to NULL.

Do we add the body to the faint cache block list when we fail the compile in JITServer? I guess my concern comes from the fact that in createMethodMetaData we call artifactManager->insertArtifact to add the metadata into the AVL trees. If a compile fails after this point, I guess I don't see that node getting removed unless we reclaim the code in HookHelpers. So there might be a problem there.

However, we're only going to reclaim bodies if we've added them to the faint cache block list, but that seems to only happen in jitReleaseCodeStackWalk, which is a runtime thing. In the issue you're running into we haven't executed the body yet. In other words, we only reclaim a body when we decide to recompile it; if a compile fails, we're supposed to just release the reserved code cache. So I'm still not sure how the code that runs in a GC cycle could get its hand on the metadata of this (currently being compiled) body; even if the node is in the AVL tree, the infra won't remove it from the AVL tree until we process the faint cache blocks.

I could be missing some other code path though.