eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.28k stars 722 forks source link

Loading AOT code based on dependencies #20529

Open cjjdespres opened 2 weeks ago

cjjdespres commented 2 weeks ago

The current approach to AOT loading involves setting a low initial invocation count (the scount, default 20) for methods with stored AOT bodies in the SCC. This invocation count guess cannot be too small, or loads would be attempted before all the classes necessary for relocation had been loaded, and it cannot be too large, or loads would be delayed unnecessarily.

An alternative is to induce an AOT load when every class necessary for relocating that AOT code has been loaded. This will both increase the reliability of AOT loads and reduce the time between when a load is possible and when a load actually occurs. (Dependency-based loading will not eliminate load failures entirely, as some runtime properties of the classes in question may have changed, but decreasing the failure rate further would require more substantial changes to the structure of relocation itself).

The plan I am following for an initial design has a few components.

  1. During an AOT compilation, the dependencies (as SCC offsets) of an AOT-compiled method will be collected and stored in the SCC, and a reference to them will be added to the header of the relocation records of the method, following the treatment of the well-known classes chain of an AOT method compiled with SVM. This will largely occur during J9::AheadOfTimeCompile::processRelocations().
  2. A map from SCC offsets to loaded classes with those offsets (with valid class chains) will be maintained, to be used to determine when method dependencies are satisfied. I have found that it is necessary to keep track of the initialization state of class dependencies (either loaded or fully-initialized), so additions to this map will be made in the class load and class initialization JIT hooks. Classes will be removed from this map on class unload or redefinition. Since the map needs to be accurate, the subclasses of redefined classes will need to be removed (if they were previously valid) or checked for validity again (if they were previously invalid), which will require the use of the persistent CH table.
  3. When methods with code in the SCC are initialized (in jitHookInitializeSendTarget()), their dependencies will be read from the SCC and the current state of those dependencies will be determined. If their dependencies are all satisfied, their initial count will be set to zero. Otherwise, they and their dependencies will be tracked, and their dependencies will be updated as updates to the map in (2) occur. When their dependencies are satisfied, their current count will be reduced to zero and they will be removed from tracking. This will trigger an AOT load the next time they are invoked.

Testing of a rough implementation with acmeair is promising - the AOT failure rate is significantly reduced, and AOT loads generally happen much earlier than the scount approach. I'll link PRs here as I open them to implement the different pieces of this design.

cjjdespres commented 2 weeks ago

Attn @mpirvu.

github-actions[bot] commented 2 weeks ago

Issue Number: 20529 Status: Open Recommended Components: comp:vm, comp:build, comp:test Recommended Assignees: tajila, gacholio, babsingh

cjjdespres commented 1 week ago

@mpirvu I did have some verbose log messages during development in a few places, but I haven't added them back in yet. Things related to offset and method tracking, registering dependencies during compilation, that sort of thing. Is there a good verbose option (or options) to put them under?

cjjdespres commented 1 week ago

Now that I've been able to do some testing of this again, I ran acmeair with -Xaot:forceAOT and found some issues with inlined method validation. One example is java/net/URLConnection.<init>(Ljava/net/URL;)V. During relocation, I see:

TR_RelocationRecordValidateDefiningClassFromCP 00007F999CA38CD4
        size 14 type 67 flags 2 reloFlags 0
        Flag: Wide offsets
        isStatic true
        classID 40
        beholderID 30
        cpindex 170
        reloLocation: from 00007F999CA38CE4 at 00007F9925CBE560 (offset ce39c620)
...
TR_RelocationRecordValidateStaticClassFromCP 00007F32CCA972D2
        size 14 type 68 flags 2 reloFlags 0
        Flag: Wide offsets
        classID 40
        beholderID 30
        cpindex 170
        reloLocation: from 00007F32CCA972E2 at 00007F32BCDD7D80 (offset 356a9f00)
        applyRelocationAtAllOffsets: rc = staticClassFromCPValidationFailure
relocateAOTCodeAndData: return code 47

Both classes (ID 30 and 40) are fully initialized, but the relocation failed because getClassOfStaticFromCP was used on the cp entry at index 170, and that entry was not yet resolved. There are a few intermediate relocation records with beholderID 30 and cpindex 170, but none of them called anything that would resolve the entry. Some entry resolution might be happening here, but only at the cp index of the method that got inlined and caused these records to be generated (index 171). The AOT load with dependency tracking happened at t=744, whereas the initial compilation and the AOT load without dependency tracking both happened at roughly t=1200, so I'm assuming the earlier load is what is causing this.

One thing I did notice is that calls to addDefiningClassFromCPRecord() in the SVM seem to be preceded by calls to jitCTResolveStaticFieldRefWithMethod(), and that function looks like it would be able to resolve the cp entry for the class itself (index 170 in this case) at compile time because of its eventual call to resolveStaticFieldRefInto(). That function is also used in TR_RelocationRecordDataAddress::applyRelocation(), so I'd imagine it's fine to use during the validation of one of these records, in case the entry isn't resolved yet - these functions just abort the resolution of the entry if J9_RESOLVE_FLAG_JIT_COMPILE_TIME is set and they encounter any problems.

I'll have to see if this works elsewhere. The number of load failures with -Xaot:forceAOT and dependency tracking is still lower than without dependency tracking (~100 versus ~350), but I think a lot of the dependency tracking failures are avoidable. Setting the invocation count higher than 0 when all the dependencies are satisfied might also help if jitCTResolveStaticFieldRefWithMethod() doesn't work, but for this particular method even a count=5 didn't seem to do anything.

(The AOT load failure total with a read-only SCC closer to what's created during open liberty container builds is roughly 10 with dependency tracking and 100 without, for comparison).

cjjdespres commented 6 days ago

After further investigation, I don't think jitCTResolveStaticFieldRefWithMethod updates what I want. I might be able to resolveClassRef() before getClassOfStaticFromCP(), though.

cjjdespres commented 3 days ago

I said in https://github.com/eclipse-openj9/openj9/issues/20669#issuecomment-2494747074 that it looked like the failure here (in https://github.com/eclipse-openj9/openj9/issues/20529#issuecomment-2484538171) was because of getClassFromSignature(), but I now see that it's this:

https://github.com/eclipse-openj9/openj9/blob/ff435327b1ea3f7ee9593545b0ab27c92673edae/runtime/compiler/runtime/SymbolValidationManager.cpp#L842-L845

that is not interacting well with some of the other validations. The compilation logs for the method with that relo failure look like this:

ClassByNameRecord
        _class=0x00000000000F0D00
        className=java/net/URL
        _classChainOffset=1289193
        _beholder=0x0000000000431900
        className=java/net/URLConnection
        kind=64
        id=30

         got fieldsig as Ljava/net/URL;
IlGenerator: Generating compressedRefs anchor [00007FA1CA2AF3F0] for node [00007FA1CA2AF350]
         got static signature as Z
VirtualMethodFromCPRecord
        _method=0x00000000000F0788       <---- this is java/net/URL.getProtocol()Ljava/lang/String;
        _beholder=0x0000000000431900
        className=java/net/URLConnection
        _cpIndex=36
        kind=87
        id=31
...
[     3] O^O INLINER: Inlining qwerty java/net/URL.getProtocol()Ljava/lang/String; into java/net/URLConnection.<init>(Ljava/net/URL;)V with a virtual guard kind=DirectMethodGuard type=NonoverriddenTest partialInline=0
addVirtualGuard 00007FA1CA37AC50, guardkind = 11, virtualGuardTestType 3, bc index 0, callee index 1, callNode 00007FA1CA32A000, guardNode 00007FA1CA32D7A0, currentInlinedSiteIdx 1
addVirtualGuard 00007FA1CA37AF50, guardkind = 8, virtualGuardTestType 3, bc index 0, callee index 1, callNode 00007FA1CA32A000, guardNode 00007FA1CA32D930, currentInlinedSiteIdx 1
        create storeNode 00007FA1CA32D980 of tempSymRef #447 (possibly for node uncommoning during opcodeExpansion)
        create storeNode 00007FA1CA32DA20 of tempSymRef #447 (possibly for node uncommoning during opcodeExpansion)
...
genIL() returned 1
            (Building alias info)
DefiningClassFromCPRecord
        _class=0x00000000001B1F00
        className=java/util/Locale
        _beholder=0x00000000000F0D00
        className=java/net/URL
        _cpIndex=170
        _isStatic=true
        kind=67
        id=40

ClassFromCPRecord
        _class=0x00000000001B1F00
        className=java/util/Locale
        _beholder=0x00000000000F0D00
        className=java/net/URL
        _cpIndex=171
        kind=66
        id=40
...
StaticClassFromCPRecord
        _class=0x00000000001B1F00
        className=java/util/Locale
        _beholder=0x00000000000F0D00
        className=java/net/URL
        _cpIndex=170
        kind=68
        id=40

I imagine that we used TR_ResolvedRelocatableJ9Method::getClassFromConstantPool() on the resolved method for java/net/URL.getProtocol()Ljava/lang/String;, which would then call TR_ResolvedRelocatableJ9Method::validateClassFromConstantPool, which resulted in the ClassByNameRecord, and only that record. The issue is that validating that record is done by validateClassByName, and that record doesn't touch any constant pool. In contrast, a ClassFromCPRecord will use:

https://github.com/eclipse-openj9/openj9/blob/ff435327b1ea3f7ee9593545b0ab27c92673edae/runtime/compiler/runtime/SymbolValidationManager.cpp#L1273

and that getClassFromCP ends up calling resolveClassRef. That function definitely looks like it can update the ramClassRefWrapper in a constant pool at compile time, provided that everything is loaded already and has the correct properties. You can see that one was created for cpIndex=171, which should be constantPool->romConstantPool[170]->classRefCPIndex.

The issue with all of this is that at compile time, we would have called TR_ResolvedJ9Method::getClassFromConstantPool to get the class at cp index 170. Since that method uses resolveClassRef on the constant pool and index directly, it is capable of setting that entry at compile time. The validate* functions for ClassByNameRecord and all of the other records involving cp index 170 in this particular compilation avoid calling resolveClassRef on exactly index 170 in the constant pool of java/net/URL, so at relocation time we never tried to update this entry, even though we should have been able to do so.

I think we can just add a ClassFromCPRecord immediately after that ClassByNameRecord in addClassFromCPRecord and that will fix this family relocation failures that I've been seeing.