eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.28k stars 721 forks source link

Non-fatal assert in extended.functional with JITServer and AOT cache: Must have shared VM access #20056

Open cjjdespres opened 2 months ago

cjjdespres commented 2 months ago

The assert occurred with a debug build on JDK21 xlinux, with a manually-started JITServer on the side, with the client requesting the JITServer AOT cache be used, in the component MockitoMockTest_0 (the first component to be run in extended.functional, I believe). I still need to check whether or not this shows up without AOT cache or without JITServer entirely.

The assert is here:

https://github.com/eclipse-openj9/openj9/blob/e81a895c37ed86b70386aae57d93131a0736d29c/runtime/compiler/env/ClassLoaderTable.cpp#L192

From the core, jvmtiRetransformClasses was called, eventually getting to internalCreateRAMClassDone. That triggered jitHookClassLoadHelper, that called associateClassLoaderWithClass, at which point the assert fired. The class that was done being created was java/lang/Object.

I see where the change might have happened, actually. Here:

https://github.com/eclipse-openj9/openj9/blob/e81a895c37ed86b70386aae57d93131a0736d29c/runtime/jvmti/jvmtiClass.c#L1195-L1199

the safePoint variable should be non-zero (the J9_EXTENDED_RUNTIME_OSR_SAFE_POINT flag is set in vm->extendedRuntimeFlags), and the backtrace indicates we were at the recreateRAMClasses call just down from there. If you look at commit https://github.com/eclipse-openj9/openj9/commit/e83d9d1cf94666f0298f3057d24ed39b431dcf66, acquireSafePointVMAccess was changed so that the vm->exclusiveCount was set to 1 when acquiring safe point VM access, and this is how the code stands today. I don't know if that's a full explanation of why the assert is firing now - if we were ever not at a safePoint before that change, we'd acquire exclusive VM access and presumably trigger the same assert when running associateClassLoaderWithClass, but perhaps that code path is somehow impossible.

The assert was added in https://github.com/eclipse-openj9/openj9/pull/12154, but I couldn't see a discussion of why shared VM access was needed specifically here. Do you happen to remember why that was the case, @mpirvu?

cjjdespres commented 2 months ago

Or @dsouzai, if you'd happen to know if shared VM access is really required here. I would imagine not.

A similar assert exists in runtime/compiler/runtime/JITServerAOTDeserializer.cpp that is also firing, I assume for the same reason as the above assert.

mpirvu commented 2 months ago

I think that because associateClassLoaderWithClass is called by an application thread while doing class loading, it must have VM access.

cjjdespres commented 2 months ago

It does have VM access, if I look at the core. It just has exclusive VM access and not shared. I would have thought that exclusive VM access would be fine at this point, but the assert specifically wants shared VM access, so I was unsure.

cjjdespres commented 1 month ago

I think I understand now - the point of these asserts is to document how it is that the class loader table can work without additional locks. There are three ways to use the class loader table:

  1. Adding an entry to the table. Asserts that the caller must have shared VM access.
  2. Looking up an entry in the table. Asserts that the caller must have shared VM access or must "hold the classUnloadMonitor for reading".
  3. Removing a class loader from the table. Asserts that the caller must have exclusive VM access and must "hold the classUnloadMonitor for writing".

Only (1) and (2) can be allowed to happen concurrently. Uses (1) and (3) are mutually exclusive because shared and exclusive VM access are, and uses (2) and (3) are mutually exclusive because the class unload monitor can't be held in both those ways simultaneously, I assume (and also because shared and exclusive VM access are mutually exclusive). So this particular assert is intended to alert us that the assumption in (1) was violated, so (1) and (3) are no longer guaranteed not to happen concurrently.

I mentioned above that I thought that it was probably the safe point VM access change (making it count as a type of exclusive access) that caused this assert to start firing - when the assert was added, safe point VM access was a type of shared VM access. If we can detect that the current thread has safe VM access, and if we know that the current thread having safe point VM access means that no other thread can be attempting (3), then we can just modify this failing assert so it passes if the caller has either shared or safe point VM access. Otherwise something different will have to be done.